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Preface 


Multi-armed bandits have now been studied for nearly a century. While research 
in the beginning was quite meandering, there is now a large community publishing 
hundreds of articles every year. Bandit algorithms are also finding their way into 
practical applications in industry, especially in on-line platforms where data is 
readily available and automation is the only way to scale. 


We had hoped to write a comprehensive book, but the literature is now so vast 
that many topics have been excluded. In the end we settled on the more modest 
goal of equipping our readers with enough expertise to explore the specialised 
literature by themselves, and to adapt existing algorithms to their applications. 
This latter point is important. Problems in theory are all alike; every application is 
different. A practitioner seeking to apply a bandit algorithm needs to understand 
which assumptions in the theory are important and how to modify the algorithm 
when the assumptions change. We hope this book can provide that understanding. 


What is covered in the book is covered in some depth. The focus is on the 
mathematical analysis of algorithms for bandit problems, but this is not a 
traditional mathematics book, where lemmas are followed by proofs, theorems 
and more lemmas. We worked hard to include guiding principles for designing 
algorithms and intuition for their analysis. Many algorithms are accompanied by 
empirical demonstrations that further aid intuition. 


We expect our readers to be familiar with basic analysis and calculus and 
some linear algebra. The book uses the notation of measure-theoretic probability 
theory, but does not rely on any deep results. A dedicated chapter is included to 
introduce the notation and provide intuitions for the basic results we need. This 
chapter is unusual for an introduction to measure theory in that it emphasises the 
reasons to use g-algebras beyond the standard technical justifications. We hope 
this will convince the reader that measure theory is an important and intuitive 
tool. Some chapters use techniques from information theory and convex analysis, 
and we devote a short chapter to each. 


Most chapters are short and should be readable in an afternoon or presented in 
a single lecture. Some components of the book contain content that is not really 
about bandits. These can be skipped by knowledgeable readers, or otherwise 
referred to when necessary. They are marked with a (#*) because ‘Skippy the 
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Kangaroo’ skips things.! The same mark is used for those parts that contain 
useful, but perhaps overly specific information for the first-time reader. Later 
parts will not build on these chapters in any substantial way. Most chapters end 
with a list of notes and exercises. These are intended to deepen intuition and 
highlight the connections between various subsections and the literature. There 
is a table of notation at the end of this preface. 


Thanks 

We’re indebted to our many collaborators and feel privileged that there are 
too many of you to name. The University of Alberta, Indiana University and 
DeepMind have all provided outstanding work environments and supported the 
completion of this book. The book has benefited enormously from the proofreading 
efforts of a large number of our friends and colleagues. We are sorry for all the 
mistakes introduced after your hard work. Alphabetically, they are: Aaditya 
Ramdas, Abbas Mehrabian, Aditya Gopalan, Ambuj Tewari, András Gyorgy, 
Arnoud den Boer, Branislav Kveton, Brendan Patch, Chao Tao, Chao Qin, 
Christoph Dann, Claire Vernade, Emilie Kaufmann, Eugene Ji, Gellért Weisz, 
Gergely Neu, Johannes Kirschner, Julian Zimmert, Kwang-Sung Jun, Lalit Jain, 
Laurent Orseau, Marcus Hutter, Michal Valko, Omar Rivasplata, Pierre Menard, 
Ramana Kumar, Roman Pogodin, Ronald Ortner, Ronan Fruit, Ruihao Zhu, 
Shuai Li, Toshiyuki Tanaka, Wei Chen, Yoan Russac, Yufei Yi and Zhu Xiaohu. 
We are especially grateful to Gabor Balázs and Wouter Koolen, who both read 
almost the entire book. Thanks to Lauren Cowels and Cambridge University 
Press for providing free books for our proofreaders, tolerating the delays and 
for supporting a freely available PDF version. Réka Szepesvari is responsible for 
converting some of our primary school figures to their current glory. Last of all, 
our families have endured endless weekends of editing and multiple false promises 
of ‘done by Christmas’. Rosina and Beata, it really is done now! 


1 Taking inspiration from Tor’s grandfather-in-law, John Dillon [Anderson et al., 1977]. 


Notation 


Some sections are marked with special symbols, which are listed and described 
below. 


E This symbol is a note. Usually this is a remark that is slightly tangential to 
the topic at hand. 


AN A warning to the reader. 
Cs Something important. 
An experiment. 


Nomenclature and Conventions 

A sequence (an); is increasing if an}ı > a, for all n > 1 and 
decreasing if an+ı < an. When the inequalities are strict, we say strictly 
increasing /decreasing. The same terminology holds for functions. We will 
not be dogmatic about what is the range of argmin/argmax. Sometimes they 
return sets, sometimes arbitrary elements of those sets and, where stated, specific 
elements of those sets. We will be specific when it is non-obvious/matters. The 
infimum of the empty set is inf Ø = co and the supremum is sup = —oo. The 


empty sum is J` ;eg a; = 0 and the empty product is [],<g a; = 1. 


Landau Notation 

We make frequent use of the Bachmann—Landau notation. Both were nineteenth 
century mathematicians who could have never expected their notation to be 
adopted so enthusiastically by computer scientists. Given functions f,g : N > 
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[0, co), define 
f(n) = O(g(n)) + lim sup a he 
= o0(g(n im Fin) = 
f(n) = o(a(n)) © Jim 2 =o, 


f(n) = Q(g(n)) > lim inf fn) >0, 


n> g(n) 


= w(g(n imin iu DERT 
f(n) = w(g(n)) © lim inf 7 = 00, 


f(n) = O(g(n)) & f(n) = O(g(n)) and f(n) = Q(g(n)). 


We make use of the (Bachmann-)Landau notation in two contexts. First, in 
proofs where limiting arguments are made, we sometimes write lower-order terms 
using Landau notation. For example, we might write that f(n) = /n+o0(./n), by 
which we mean that limpo f(n)//n = 1. In this case we use the mathematical 
definitions as envisaged by Bachmann and Landau. The second usage is to 
informally describe a result without the clutter of uninteresting constants. For 
better or worse, this usage is often a little imprecise. For example, we will often 
write expressions of the form: R, = O(mVdn). Almost always what is meant 
by this is that there exists a universal constant c > 0 (a constant that does 
not depend on either of the quantities involved) such that Rn < cmv dn for all 
(reasonable) choices of m, d and n. In this context we are careful not to use Landau 
notation to hide large lower-order terms. For example, if f(a) = x? + 10/°x, we 
will not write f(x) = O(a7), although this would be true. 


Bandits 
At action in round t 
k number of arms/actions 
time horizon 
Xt reward in round t 
Yı loss in round t 
T a policy 
v a bandit 
Hi mean reward of arm i 
Sets 
0 empty set 
N, Nt natural numbers, N = {0,1,2,...} and Nt = N \ {0} 
R real numbers 
R RU {—00, co} 
n] {1,2,3,... n — l,n} 
JA the power set of set A (the set of all subsets of A) 
A* set of finite sequences over A, A* = Uco A’ 
Be d-dimensional unit ball, {x € R? : ||æll2 < 1} 


Pa probability simplex, {x € [0,1]4+?: ||a||, = 1} 
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P(A) set of distributions over set A 

B(A) Borel o-algebra on A 

x,y] convex hull of vectors or real values x and y 

Functions, Operators and Operations 

Al the cardinality (number of elements) of the finite set A 
(a) max(z, 0) 

amod b remainder when natural number a is divided by b 

læ], [x] floor and ceiling functions of x 
dom(f) domain of function f 

) expectation 
y variance 

Supp support of distribution or random variable 

Vf(x) gradient of f at x 

Vol (2) directional derivative of f at x in direction v 

V? F(z) Hessian of f at x 

V, ^ maximum and minimum, aVb = max(a, b) and a^b = min(a, b) 
erf(x) Fa So exp(—y?) dy 
erfe(x) 1 — erf(x) 
T(z) Gamma function, T(z) = [S 2771 exp(—x)dx 

alx) support function z = supyea(Z; Y) 
f(y) convex conjugate, f*(y) = suppea (2, y) — f(z) 

69) binomial coefficient 
argmax, f(x) maximiser or maximisers of f 
argmin, f(x) minimiser or minimisers of f 
I¢ indicator function: converts Boolean ¢ into binary 
Ip indicator of set B 
D(P, Q) Relative entropy between probability distributions P and Q 
d(p, q) Relative entropy between B(p) and B(q) 
Linear Algebra 
€1,---,€¢ standard basis vectors of the d-dimensional Euclidean space 
0,1 vectors whose elements are all zeros and all ones, respectively 
det (A) determinant of matrix A 
trace(A) trace of matrix A 
im(A) image of matrix A 
ker(A) kernel of matrix A 
span(v1,...,Ua) span of vectors v1,...,Va 
Amin(G) minimum eigenvalue of matrix G 

(x,y) inner product, (x, y) =o; viys 

lz] p-norm of vector x 


alka x' Gz for positive definite G € R¢*4 and x € R? 


<3 


Distributions 
N (u, 07) 

B(p) 

U(a, b) 

Beta(a, 8) 

Ox 


Topological 
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Loewner partial order of positive semidefinite matrices: A < B 
(A < B) if B— A is positive semidefinite (respectively, definite). 


Normal distribution with mean p and variance o? 
Bernoulli distribution with mean p 

uniform distribution supported on [a,b] 

Beta distribution with parameters a, 8 > 0 

Dirac distribution with point mass at x 


closure of set A 

interior of set A 

boundary of a set A, OA = cl(A) \ int(A) 
convex hull of A 

affine hull of A 

relative interior of A 
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Bandits, Probability and 
Concentration 


Introduction 


Bandit problems were introduced by William R. Thompson in an article 
published in 1933 in Biometrika. Thompson was interested in medical 
trials and the cruelty of running a trial blindly, without adapting the 
treatment allocations on the fly as the drug appears more or less effective. 
The name comes from the 1950s, when 
Frederick Mosteller and Robert Bush decided 
to study animal learning and ran trials on 
mice and then on humans. The mice faced 
the dilemma of choosing to go left or right 
after starting in the bottom of a T-shaped 
maze, not knowing each time at which end 
they would find food. To study a similar 
learning setting in humans, a ‘two-armed 


bandit’ machine was commissioned where 
humans could choose to pull either the left or 
the right arm of the machine, each giving a 
random pay-off with the distribution of pay- 
offs for each arm unknown to the human player. The machine was called a 
‘two-armed bandit’ in homage to the one-armed bandit, an old-fashioned name 
for a lever-operated slot machine (‘bandit’ because they steal your money). 


Figure 1.1 Mouse learning a T-maze. 


There are many reasons to care about bandit problems. Decision-making with 
uncertainty is a challenge we all face, and bandits provide a simple model of 
this dilemma. Bandit problems also have practical applications. We already 
mentioned clinical trial design, which researchers have used to motivate their 
work for 80 years. We can’t point to an example where bandits have actually 
been used in clinical trials, but adaptive experimental design is gaining popularity 
and is actively encouraged by the US Food and Drug Administration, with the 
justification that not doing so can lead to the withholding of effective drugs until 
long after a positive effect has been established. 


While clinical trials are an important application for the future, there are 
applications where bandit algorithms are already in use. Major tech companies 
use bandit algorithms for configuring web interfaces, where applications include 
news recommendation, dynamic pricing and ad placement. A bandit algorithm 
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plays a role in Monte Carlo Tree Search, an algorithm made famous by the recent 
success of AlphaGo. 

Finally, the mathematical formulation of bandit problems leads to a rich 
structure with connections to other branches of mathematics. In writing this 
book (and previous papers), we have read books on convex analysis/optimisation, 
Brownian motion, probability theory, concentration analysis, statistics, differential 
geometry, information theory, Markov chains, computational complexity and more. 
What fun! 

A combination of all these factors has led to an enormous growth in research 
over the last two decades. Google Scholar reports less than 1000, then 2700 and 
7000 papers when searching for the phrase ‘bandit algorithm’ for the periods of 
2001-5, 2006-10, and 2011-15, respectively, and the trend just seems to have 
strengthened since then, with 5600 papers coming up for the period of 2016 to 
the middle of 2018. Even if these numbers are somewhat overblown, they are 
indicative of a rapidly growing field. This could be a fashion, or maybe there is 
something interesting happening here. We think that the latter is true. 


A Classical Dilemma 


Imagine you are playing a two-armed bandit machine and you already pulled 
each lever five times, resulting in the following pay-offs (in dollars): 


Round 1 2 3 4 5 6 7 8 9 10 


LEFT 0 10 0 0 10 


RIGHT 10 0 0 0 0 


The left arm appears to be doing slightly better. The 
average pay-off for this arm is $4, while the average for the Figure 1.2 Two- 
right arm is only $2. Let’s say you have 10 more trials (pulls) armed bandit 
altogether. What is your strategy? Will you keep pulling 

the left arm, ignoring the right? Or would you attribute the poor performance of 
the right arm to bad luck and try it a few more times? How many more times? 
This illustrates one of the main interests in bandit problems. They capture the 
fundamental dilemma a learner faces when choosing between uncertain options. 
Should one explore an option that looks inferior or exploit by going with the 
option that looks best currently? Finding the right balance between exploration 
and exploitation is at the heart of all bandit problems. 


1.1 
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The Language of Bandits 


A bandit problem is a sequential game between a learner and an environment. 
The game is played over n rounds, where n is a positive natural number called 
the horizon. In each round t € [n], the learner first chooses an action A; from a 
given set A, and the environment then reveals a reward X; € R. 


In the literature, actions are often also called ‘arms’. We talk about k-armed 
bandits when the number of actions is k, and about multi-armed bandits 
when the number of arms is at least two and the actual number is immaterial 
to the discussion. If there are multi-armed bandits, there are also one-armed 
bandits, which are really two-armed bandits where the pay-off of one of the 
arms is a known fixed deterministic number. 


Of course the learner cannot peek into the future when choosing their 
actions, which means that A; should only depend on the history H;—-ı = 
(Ai, X1,...,At-1, Xt-1). A policy is a mapping from histories to actions: A 
learner adopts a policy to interact with an environment. An environment is a 
mapping from history sequences ending in actions to rewards. Both the learner 
and the environment may randomise their decisions, but this detail is not so 
important for now. The most common objective of the learner is to choose actions 
that lead to the largest possible cumulative reward over all n rounds, which is 
et Xt. 

The fundamental challenge in bandit problems is that the environment is 
unknown to the learner. All the learner knows is that the true environment 
lies in some set E called the environment class. Most of this book is about 
designing policies for different kinds of environment classes, though in some cases 
the framework is extended to include side observations as well as actions and 
rewards. 

The next question is how to evaluate a learner. We discuss several performance 
measures throughout the book, but most of our efforts are devoted to 
understanding the regret. There are several ways to define this quantity. To avoid 
getting bogged down in details, we start with a somewhat informal definition. 


DEFINITION 1.1. The regret of the learner relative to a policy m (not necessarily 
that followed by the learner) is the difference between the total expected reward 
using policy a for n rounds and the total expected reward collected by the learner 
over n rounds. The regret relative to a set of policies II is the maximum regret 
relative to any policy m € II in the set. 


The set II is often called the competitor class. Another way of saying all this 
is that the regret measures the performance of the learner relative to the best 
policy in the competitor class. We usually measure the regret relative to a set of 
policies II that is large enough to include the optimal policy for all environments 
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in €. In this case, the regret measures the loss suffered by the learner relative to 
the optimal policy. 


EXAMPLE 1.2. Suppose the action set is A = {1,2,...,4}. An environment is 
called a stochastic Bernoulli bandit if the reward X; € {0,1} is binary valued 
and there exists a vector u € [0,1]* such that the probability that X; = 1 given 
the learner chose action A; = a is Ha. The class of stochastic Bernoulli bandits is 
the set of all such bandits, which are characterised by their mean vectors. If you 
knew the mean vector associated with the environment, then the optimal policy 
is to play the fixed action a* = argmax,¢ 4 Ha- This means that for this problem 
the natural competitor class is the set of k constant polices I] = {m1,..., Tk}, 
where 7; chooses action 7 in every round. The regret over n rounds becomes 


n 
Ra = NMAX Ha -E Sx 5 


where the expectation is with respect to the randomness in the environment and 


policy. The first term in this expression is the maximum expected reward using 
any policy. The second term is the expected reward collected by the learner. 


For a fixed policy and competitor class, the regret depends on the environment. 
The environments where the regret is large are those where the learner is behaving 
worse. Of course the ideal case is that the regret be small for all environments. 
The worst-case regret is the maximum regret over all possible environments. 

One of the core questions in the study of bandits is to understand the growth 
rate of the regret as n grows. A good learner achieves sublinear regret. Letting Rn 
denote the regret over n rounds, this means that Rn = o(n) or equivalently that 
limn+oo Rn/n = 0. Of course one can ask for more. Under what circumstances is 
Rn = O(n) or Rn = O(log(n))? And what are the leading constants? How does 
the regret depend on the specific environment in which the learner finds itself? 
We will discover eventually that for the environment class in Example 1.2, the 
worst-case regret for any policy is at least 0(,/n) and that there exist policies for 
which Rn = O(,/n). 


A large environment class corresponds to less knowledge by the learner. A 
large competitor class means the regret is a more demanding criteria. Some 
care is sometimes required to choose these sets appropriately so that (a) 
guarantees on the regret are meaningful and (b) there exist policies that 
make the regret small. 


The framework is general enough to model almost anything by using a rich 
enough environment class. This cannot be bad, but with too much generality it 
becomes impossible to say much. For this reason, we usually restrict our attention 
to certain kinds of environment classes and competitor classes. 

A simple problem setting is that of stochastic stationary bandits. In this 
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case the environment is restricted to generate the reward in response to each action 
from a distribution that is specific to that action and independent of the previous 
action choices and rewards. The environment class in Example 1.2 satisfies these 
conditions, but there are many alternatives. For example, the rewards could follow 
a Gaussian distribution rather than Bernoulli. This relatively mild difference does 
not change the nature of the problem in a significant way. A more drastic change 
is to assume the action set A is a subset of R? and that the mean reward for 
choosing some action a € A follows a linear model, X; = (a,0) + m for 0 € R? 
and 7, a standard Gaussian (zero mean, unit variance). The unknown quantity 
in this case is 0, and the environment class corresponds to its possible values 
(E = RÌ). 

For some applications, the assumption that the rewards are stochastic and 
stationary may be too restrictive. The world mostly appears deterministic, even 
if it is hard to predict and often chaotic looking. Of course, stochasticity has 
been enormously successful in explaining patterns in data, and this may be 
sufficient reason to keep it as the modelling assumption. But what if the stochastic 
assumptions fail to hold? What if they are violated for a single round? Or just for 
one action, at some rounds? Will our best algorithms suddenly perform poorly? 
Or will the algorithms developed be robust to smaller or larger deviations from 
the modelling assumptions? 

An extreme idea is to drop all assumptions on how the rewards are generated, 
except that they are chosen without knowledge of the learner’s actions and lie 
in a bounded set. If these are the only assumptions, we get what is called the 
setting of adversarial bandits. The trick to say something meaningful in this 
setting is to restrict the competitor class. The learner is not expected to find 
the best sequence of actions, which may be like finding a needle in a haystack. 
Instead, we usually choose II to be the set of constant policies and demand that 
the learner is not much worse than any of these. By defining the regret in this 
way, the stationarity assumption is transported into the definition of regret rather 
than constraining the environment. 

Of course there are all shades of grey between these two extremes. Sometimes 
we consider the case where the rewards are stochastic, but not stationary. Or 
one may analyse the robustness of an algorithm for stochastic bandits to small 
adversarial perturbations. Another idea is to isolate exactly which properties of 
the stochastic assumption are really exploited by a policy designed for stochastic 
bandits. This kind of inverse analysis can help explain the strong performance of 
policies when facing environments that clearly violate the assumptions they were 
designed for. 


Other Learning Objectives 


We already mentioned that the regret can be defined in several ways, each 
capturing slightly different aspects of the behaviour of a policy. Because the 
regret depends on the environment, it becomes a multi-objective criterion: ideally, 
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we want to keep the regret small across all possible environments. One way to 
convert a multi-objective criterion into a single number is to take averages. This 
corresponds to the Bayesian viewpoint where the objective is to minimise the 
average cumulative regret with respect to a prior on the environment class. 

Maximising the sum of rewards is not always the objective. Sometimes the 
learner just wants to find a near-optimal policy after n rounds, but the actual 
rewards accumulated over those rounds are unimportant. We will see examples 
of this shortly. 


Limitations of the Bandit Framework 


One of the distinguishing features of all bandit problems studied in this book 
is that the learner never needs to plan for the future. More precisely, we will 
invariably make the assumption that the learner’s available choices and rewards 
tomorrow are not affected by their decisions today. Problems that do require 
this kind of long-term planning fall into the realm of reinforcement learning, 
which is the topic of the final chapter. Another limitation of the bandit framework 
is the assumption that the learner observes the reward in every round. The setting 
where the reward is not observed is called partial monitoring and is the topic 
of Chapter 37. Finally, often, the environment itself consists of strategic agents, 
which the learner needs to take into account. This problem is studied in game 
theory and would need a book on its own. 


Applications 


After this short preview, and as an appetiser before the hard work, we briefly 
describe the formalisations of a variety of applications. 


A/B Testing 

The designers of a company website are trying to decide whether the ‘buy it now 
button should be placed at the top of the product page or at the bottom. In 
the old days, they would commit to a trial of each version by splitting incoming 
users into two groups of 10000. Each group would be shown a different version 
of the site, and a statistician would examine the data at the end to decide which 
version was better. One problem with this approach is the non-adaptivity of the 
test. For example, if the effect size is large, then the trial could be stopped early. 


? 


One way to apply bandits to this problem is to view the two versions of the 
site as actions. Each time t a user makes a request, a bandit algorithm is used 
to choose an action A; € A = {SITEA, SITEB}, and the reward is X; = 1 if the 
user purchases the product and X; = 0 otherwise. 
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In traditional A/B testing, the objective of the statistician is to decide which 
website is better. When using a bandit algorithm, there is no need to end 
the trial. The algorithm automatically decides when one version of the site 
should be shown more often than another. Even if the real objective is to 
identify the best site, then adaptivity or early stopping can be added to the 
A/B process using techniques from bandit theory. While this is not the focus 
of this book, some of the basic ideas are explained in Chapter 33. 


Advert Placement 

In advert placement, each round corresponds to a user visiting a website, and 
the set of actions A is the set of all available adverts. One could treat this as 
a standard multi-armed bandit problem, where in each round a policy chooses 
A, € A, and the reward is X; = 1 if the user clicked on the advert and X; = 0 
otherwise. This might work for specialised websites where the adverts are all 
likely to be appropriate. But for a company like Amazon, the advertising should 
be targeted. A user that recently purchased rock-climbing shoes is much more 
likely to buy a harness than another user. Clearly an algorithm should take this 
into account. 

The standard way to incorporate this additional knowledge is to use the 
information about the user as context. In its simplest formulation, this might 
mean clustering users and implementing a separate bandit algorithm for each 
cluster. Much of this book is devoted to the question of how to use side information 
to improve the performance of a learner. 

This is a good place to emphasise that the world is messy. The set of available 
adverts is changing from round to round. The feedback from the user can be 
delayed for many rounds. Finally, the real objective is rarely just to maximise 
clicks. Other metrics such as user satisfaction, diversity, freshness and fairness, 
just to mention a few, are important too. These are the kinds of issues that make 
implementing bandit algorithms in the real world a challenge. This book will not 
address all these issues in detail. Instead we focus on the foundations and hope 
this provides enough understanding that you can invent solutions for whatever 
peculiar challenges arise in your problem. 


Recommendation Services 

Netflix has to decide which movies to place most prominently in your ‘Browse’ 
page. Like in advert placement, users arrive at the page sequentially, and the 
reward can be measured as some function of (a) whether or not you watched a 
movie and (b) whether or not you rated it positively. There are many challenges. 
First of all, Netflix shows a long list of movies, so the set of possible actions 
is combinatorially large. Second, each user watches relatively few movies, and 
individual users are different. This suggests approaches such as low-rank matrix 
factorisation (a popular approach in ‘collaborative filtering’). But notice this is 
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not an offline problem. The learning algorithm gets to choose what users see and 
this affects the data. If the users are never recommended the AlphaGo movie, 
then few users will watch it, and the amount of data about this film will be 
scarce. 


Network Routing 

Another problem with an interesting structure is network routing, where the 
learner tries to direct internet traffic through the shortest path on a network. In 
each round the learner receives the start/end destinations for a packet of data. 
The set of actions is the set of all paths starting and ending at the appropriate 
points on some known graph. The feedback in this case is the time it takes for 
the packet to be received at its destination, and the reward is the negation of 
this value. Again the action set is combinatorially large. Even relatively small 
graphs have an enormous number of paths. The routing problem can obviously 
be applied to more physical networks such as transportation systems used in 
operations research. 


Dynamic Pricing 

In dynamic pricing, a company is trying to automatically optimise the price of 
some product. Users arrive sequentially, and the learner sets the price. The user 
will only purchase the product if the price is lower than their valuation. What 
makes this problem interesting is (a) the learner never actually observes the 
valuation of the product, only the binary signal that the price was too low/too 
high, and (b) there is a monotonicity structure in the pricing. If a user purchased 
an item priced at $10, then they would surely purchase it for $5, but whether or 
not it would sell when priced at $11 is uncertain. Also, the set of possible actions 
is close to continuous. 


Waiting Problems 

Every day you travel to work, either by bus or by walking. Once you get on the 
bus, the trip only takes 5 minutes, but the timetable is unreliable, and the bus 
arrival time is unknown and stochastic. Sometimes the bus doesn’t come at all. 
Walking, on the other hand, takes 30 minutes along a beautiful river away from 
the road. The problem is to devise a policy for choosing how long to wait at 
the bus stop before giving up and walking to minimise the time to get to your 
workplace. Walk too soon, and you miss the bus and gain little information. But 
waiting too long also comes at a price. 

While waiting for a bus is not a problem we all face, there are other applications 
of this setting. For example, deciding the amount of inactivity required before 
putting a hard drive into sleep mode or powering off a car engine at traffic lights. 
The statistical part of the waiting problem concerns estimating the cumulative 
distribution function of the bus arrival times from data. The twist is that the 
data is censored on the days you chose to walk before the bus arrived, which 
is a problem analysed in the subfield of statistics called survival analysis. The 
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interplay between the statistical estimation problem and the challenge of balancing 
exploration and exploitation is what makes this and the other problems studied 
in this book interesting. 


Resource Allocation 

A large part of operations research is focussed on designing strategies for allocating 
scarce resources. When the dynamics of demand or supply are uncertain, the 
problem has elements reminiscent of a bandit problem. Allocating too few 
resources reveals only partial information about the true demand, but allocating 
too many resources is wasteful. Of course, resource allocation is broad, and many 
problems exhibit structure that is not typical of bandit problems, like the need 
for long-term planning. 


Tree Search 

The UCT algorithm is a tree search algorithm commonly used in perfect- 
information game-playing algorithms. The idea is to iteratively build a search 
tree where in each iteration the algorithm takes three steps: (1) chooses a path 
from the root to a leaf; (2) expands the leaf (if possible); (3) performs a Monte 
Carlo roll-out to the end of the game. The contribution of a bandit algorithm is in 
selecting the path from the root to the leaves. At each node in the tree, a bandit 
algorithm is used to select the child based on the series of rewards observed 
through that node so far. The resulting algorithm can be analysed theoretically, 
but more importantly has demonstrated outstanding empirical performance in 
game-playing problems. 


Notes 


1 The reader may find it odd that at one point we identified environments with 
maps from histories to rewards, while we used the language that a learner 
‘adopts a policy’ (a map from histories to actions). The reason is part historical 

and part because policies and their design are at the center of the book, while 

the environment strategies will mostly be kept fixed (and relatively simple). 

On this note, strategy is also a word that sometimes used interchangeably with 


policy. 


Bibliographic Remarks 


As we mentioned in the very beginning, the first paper on bandits was by 
Thompson [1933]. The experimentation on mice and humans that led to the 
name comes from the paper by Bush and Mosteller [1953]. Much credit for the 
popularisation of the field must go to famous mathematician and statistician, 
Herbert Robbins, whose name appears on many of the works that we reference, 
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with the earliest being: [Robbins, 1952]. Another early pioneer is Herman Chernoff, 
who wrote papers with titles like ‘Sequential Decisions in the Control of a 
Spaceship’ [Bather and Chernoff, 1967]. 

Besides these seminal papers, there are already a number of books on bandits 
that may serve as useful additional reading. The most recent (and also most 
related) is by Bubeck and Cesa-Bianchi [2012] and is freely available online. This is 
an excellent book and is warmly recommended. The main difference between their 
book and ours is that (a) we have the benefit of seven years of additional research 
in a fast-moving field and (b) our longer page limit permits more depth. Another 
relatively recent book is Prediction, Learning and Games by Cesa-Bianchi and 
Lugosi [2006]. This is a wonderful book, and quite comprehensive. But its scope 
is ‘all of’ online learning, which is so broad that bandits are not covered in great 
depth. We should mention there is also a recent book on bandits by Slivkins 
[2019]. Conveniently it covers some topics not covered in this book (notably 
Lipschitz bandits and bandits with knapsacks). The reverse is also true, which 
should not be surprising since our book is currently 400 pages longer. There are 
also four books on sequential design and multi-armed bandits in the Bayesian 
setting, which we will address only a little. These are based on relatively old 
material, but are still useful references for this line of work and are well worth 
reading [Chernoff, 1959, Berry and Fristedt, 1985, Presman and Sonin, 1990, 
Gittins et al., 2011]. 

Without trying to be exhaustive, here are a few articles applying bandit 
algorithms; a recent survey is by Bouneffouf and Rish [2019]. The papers 
themselves will contain more useful pointers to the vast literature. We mentioned 
AlphaGo already [Silver et al., 2016]. The tree search algorithm that drives its 
search uses a bandit algorithm at each node [Kocsis and Szepesvari, 2006]. Le et al. 
[2014] apply bandits to wireless monitoring, where the problem is challenging 
due to the large action space. Lei et al. [2017] design specialised contextual 
bandit algorithms for just-in-time adaptive interventions in mobile health: in 
the typical application the user is prompted with the intention of inducing a 
long-term beneficial behavioural change. See also the article by Greenewald et al. 
[2017]. Rafferty et al. [2018] apply Thompson sampling to educational software 
and note the trade-off between knowledge and reward. Sadly, by 2015, bandit 
algorithms still have not been used in clinical trials, as explicitly mentioned 
by Villar et al. [2015]. Microsoft offers a ‘Decision Service’ that uses bandit 
algorithms to automate decision-making [Agarwal et al., 2016]. 


2.1 


Foundations of Probability (Æ) 


This chapter covers the fundamental concepts of measure-theoretic probability, 
on which the remainder of this book relies. Readers familiar with this topic can 
safely skip the chapter, but perhaps a brief reading would yield some refreshing 
perspectives. Measure-theoretic probability is often viewed as a necessary evil, 
to be used when a demand for rigour combined with continuous spaces breaks 
the simple approach we know and love from high school. We claim that measure- 
theoretic probability offers more than annoying technical machinery. In this 
chapter we attempt to prove this by providing a non-standard introduction. 
Rather than a long list of definitions, we demonstrate the intuitive power of 
the notation and tools. For those readers with little prior experience in measure 
theory this chapter will no doubt be a challenging read. We think the investment 
is worth the effort, but a great deal of the book can be read without it, provided 
one is willing to take certain results on faith. 


Probability Spaces and Random Elements 


The thrill of gambling comes from the fact that the bet is placed on future 
outcomes that are uncertain at the time of the gamble. A central question in 
gambling is the fair value of a game. This can be difficult to answer for all but 
the simplest games. As an illustrative example, imagine the following moderately 
complex game: I throw a dice. If the result is four, I throw two more dice; otherwise 
I throw one dice only. Looking at each newly thrown dice (one or two), I repeat 
the same, for a total of three rounds. Afterwards, I pay you the sum of the values 
on the faces of the dice. How much are you willing to pay to play this game with 
me? 

Many examples of practical interest exhibit a complex random interdependency 
between outcomes. The cornerstone of modern probability as proposed by 
Kolmogorov aims to remove this complexity by separating the randomness from 
the mechanism that produces the outcome. 

Instead of rolling the dice one by one, imagine that sufficiently many dice were 
rolled before the game has even started. For our game we need to roll seven 
dice, because this is the maximum number that might be required (one in the 
first round, two in the second round and four in the third round. See Fig. 2.1). 
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Figure 2.1 The initial phase of a gambling game with a random number of dice rolls. 
Depending on the outcome of a dice roll, one or two dice are rolled for a total of three 
rounds. The number of dice used will then be random in the range of three to seven. 


After all the dice are rolled, the game can be emulated by ordering the dice and 
revealing the outcomes sequentially. Then the value of the first dice in the chosen 
ordering is the outcome of the dice in the first round. If we see a four, we look at 
the next two dice in the ordering; otherwise we look at the single next dice. 


By taking this approach, we get a simple calculus for the probabilities of all 
kinds of events. Rather than directly calculating the likelihood of each pay-off, 
we first consider the probability of any single outcome of the dice. Since there 
are seven dice, the set of all possible outcomes is Q = {1,...,6}7. Because 
all outcomes are equally probable, the probability of any w € Q is (1/6)’. The 
probability of the game pay-off taking value v can then be evaluated by calculating 
the total probability assigned to all those outcomes w € Q that would result 
in the value of v. In principle, this is trivial to do thanks to the separation of 
everything that is probabilistic from the rest. The set 2 is called the outcome 
space, and its elements are the outcomes. Fig. 2.2 illustrates this idea. Random 
outcomes are generated on the left, while on the right, various mechanisms are 
used to arrive at values; some of these values may be observed and some not. 


There will be much benefit from being a little more formal about how we 
come up with the value of our artificial game. For this, note that the process by 
which the game gets its value is a function X that maps Q to the reals (simply, 
X : Q > R). We find it ironic that functions of this type (from the outcome 
space to subsets of the reals) are called random variables. They are neither 
random nor variables in a programming language sense. The randomness is in 
the argument that X is acting on, producing randomly changing results. Later 
we will put a little more structure on random variables, but for now it suffices to 
think of them as maps from the outcome space to the reals. 


= 
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Outcomes 


(a, F) 


Randomising device 
= all randomness 


Mechanisms 


Figure 2.2 A key idea in probability theory is the separation of sources of randomness 
from game mechanisms. A mechanism creates values from the elementary random 
outcomes, some of which are visible for observers, while others may remain hidden. 


We follow the standard convention in probability theory where random 
variables are denoted by capital letters. Be warned that capital letters are 
also used for other purposes as demanded by different conventions. 


Pick some number v € N. What is the probability of seeing X = v? As 
described above, this probability is (1/6)” times the size of the set X~!(v) = 
{w EQ : X(w) = v}. The set X~!(v) is called the preimage of v under X. More 
generally, the probability that X takes its value in some set A C N is given by 
(1/6) times the cardinality of X¥~1(A) = {w E Q : X(w) € A}, where we have 
overloaded the definition of X~! to set-valued inputs. 

Notice in the previous paragraph we only needed probabilities assigned to 
subsets of Q, regardless of the question asked. To make this a bit more general, 
let us introduce a map P that assigns probabilities to certain subsets of Q. The 
intuitive meaning of P is as follows. Random outcomes are generated in Q. The 
probability that an outcome falls into a set A C Q is P(A). If A is not in the 
domain of P, then there is no answer to the question of the probability of the 
outcome falling in A. But let’s postpone the discussion of why P should be 
restricted to only certain subsets of Q later. In the above example with the dice, 
the set of subsets in the domain of P is not restricted and, in particular, for any 
subset A C Q, P(A) = (1/6)"|A]. 

The probability of seeing X taking the value of v is thus P(X~1(v)). To 
minimise clutter, the more readable notation for this is P(X = v). But always 
keep in mind that this familiar form is just a shorthand for P (X~1(v)). More 
generally, we also use 


P (predicate(U, V,...)) = P ({w E Q : predicate(U(w), V(w),...) is true}) 
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with any predicate (an expression evaluating to true or false) where U,V,... are 
functions with domain 2. 

What properties should P satisfy? Since Q is the set of all possible outcomes, 
it seems reasonable to expect that P is defined for Q and P(Q) = 1 and since @ 
contains no outcomes, P(@) = 0 is also expected to hold. Furthermore, probabilities 
should be non-negative so P(A) > 0 for any A C 2 on which P is defined. Let 
Ac = Q \ A be the complement of A. Then we should expect that P is defined 
for A exactly when it is defined for A° and P(A‘) = 1 — P(A) (negation rule). 
Finally, if A, B are disjoint so that AN B = and P(A), P(B) and P(A U B) are 
all defined, then P(A U B) = P(A) + P(B). This is called the finite additivity 
property. 

Let F be the set of subsets of Q on which P is defined. It would seem silly if 
A € F and A° ¢ F, since P(A‘) could simply be defined by P(A°) = 1 — P(A). 
Similarly, if P is defined on disjoint sets A and B, then it makes sense if AUB € F. 
We will also require the additivity property to hold (i) regardless of whether 
the sets are disjoint and (ii) even for countably infinitely many sets. If {A;}; 
is a collection of sets and A; € F for all i € N, then U;A; € F, and if these 
sets are pairwise disjoint, P(U;A;) = }>; P(A;). A set of subsets that satisfies all 
these properties is called a o-algebra, which is pronounced ‘sigma-algebra’ and 
sometimes also called a o-field (see Note 1). 


DEFINITION 2.1 (g-algebra and probability measures). A set F C 2° is a ø- 
algebra if Q € F and A‘ € F for all A € F and U;A; € F for all {A;}; with 
A; € F for alli € N. That is, it should include the whole outcome space and 
be closed under complementation and countable unions. A function P: F —> R 
is a probability measure if P(Q) = 1 and for all A € F, P(A) > 0 and 
P(A°) = 1—P(A) and P(U;A;) = 95, P(Ai) for all countable collections of disjoint 
sets {A;}; with A; € F for all i. If F is a o-algebra and G C F is also a o-algebra, 
then we say G is a sub-o-algebra of F. If P is a measure defined on F, then 
the restriction of P to G is a measure Pig on G defined by Pjg(A) = P(A) for 
all AEG. 


At this stage, the reader may rightly wonder about why we introduced the notion 
of sub-o-algebras. The answer should become clear quite soon. The elements 
of F are called measurable sets. They are measurable in the sense that P 
assigns values to them. The pair (Q, F) alone is called a measurable space, 
while the triplet (Q, F, P) is called a probability space. If the condition that 
P(Q) = 1 is lifted, then P is called a measure. If the condition that P(A) > 0 
is also lifted, then P is called a signed measure. For measures and signed 
measures, it would be unusual to use the symbol P, which is mostly reserved for 
probabilities. Probability measures are also called probability distributions, 
or just distributions. 

Random variables lead to new probability measures. In particular, in the 
example above Px (A) = P (X~1(A)) is a probability measure defined for all the 
subsets A of R for which P (X~1(A)) is defined. More generally, for a random 
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variable X, the probability measure Px is called the law of X, or the push- 
forward measure of P under X. 


The significance of the push-forward measure Px is that any probabilistic 
question concerning X can be answered from the knowledge of Px alone. 
Even Q and the details of the map X are not needed. This is often used as 
an excuse to not even mention the underlying probability space (Q, F, P). 


If we keep X fixed but change P (for example, by switching to loaded dice), 
then the measure induced by X changes. We will often use arguments that do 
exactly this, especially when proving lower bounds on the limits of how well 
bandit algorithms can perform. 

The astute reader would have noticed that we skipped over some details. 
Measures are defined as functions from a o-algebra to R, so if we want to call 
Px a measure, then its domain {A C R : X~1(A) € F} better be a o-algebra. 
This holds in great generality. You will show in Exercise 2.3 that for functions 
X : Q — X with ¥ arbitrary, the collection {A C ¥ : XHA) € Fh isa 
o-algebra. 

It will be useful to generalise our example a little by allowing X to take on 
values in sets other than the reals. For example, the range could be vectors 
or abstract objects like sequences. Let (Q, F) be a measurable space, V be an 
arbitrary set and G C 2° A function X : Q — X is called an F /G-measurable 
map if X~'(A) € F for all A € G. Note that G need not be a o-algebra. 
When F and G are obvious from the context, X is called a measurable map. 
What are the typical choices for G? When X is real-valued, it is usual to let 
G = {(a,b): a < b with a,b € R} be the set of all open intervals. The reader can 
verify that if X is F/G-measurable, then it is also F/o(G)-measurable, where 
a(G) is the smallest o-algebra that contains G. This smallest o-algebra can be 
shown to exist. Furthermore, it contains exactly those sets A that are in every 
o-algebra that contains G (see Exercise 2.5). When G is the set of open intervals, 
a(G) is usually denoted by B or B(R) and is called the Borel o-algebra of R. 
This definition is extended to R* by replacing open intervals with open rectangles 
of the form II (ai; bi), where a < b € R*. If G is the set of all such open 
rectangles, then a(G) is the Borel o-algebra: 8(R*). More generally, the Borel 
o-algebra of a topological space ¥ is the o-algebra generated by the open sets of 
X. 


DEFINITION 2.2 (Random variables and elements). A random variable 
(random vector) on measurable space (Q, F) is a F/B(R)-measurable function 
X : Q > R (respectively F/8(R*)-measurable function X : Q > R*). A random 
element between measurable spaces (Q, F) and (¥,G) is a F/G-measurable 
function X : Q > X. 


Thus, random vectors are random elements where the range space is 
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(R*,8(R*)), and random vectors are random variables when k = 1. Random 
elements generalise random variables and vectors to functions that do not 
take values in R*. The push-forward measure (or law) can be defined for 
any random element. Furthermore, random variables and vectors work nicely 
together. If X1,...,X, are k random variables on the same domain (Q, F), 
then X(w) = (X1(w),...,X4(w)) is an R*-valued random vector, and vice versa 
(Exercise 2.2). Multiple random variables X1,..., Xp from the same measurable 
space can thus be viewed as a random vector X = (X1,..., Xz). 

Given a map X : Q —> ¥ between measurable spaces (Q, F) and (4’,G), we let 
o(X) ={X~1!(A): A €G} be the o-algebra generated by X. The map X is 
F /G-measurable if and only if o(X) C F. By checking the definitions one can 
show that o(X) is a sub-o-algebra of F and in fact is the smallest sub-o-algebra 
for which X is measurable. If G = o(A) itself is generated by a set system 
A c 2°, then to check the F/G-measurability of X, it suffices to check whether 
X~1(A) = {X71!(A) : A € A} is a subset of F. The reason this is sufficient is 
because o(X~1(.A)) = X~1(o(A)), and by definition the latter is o(X). In fact, 
to check whether a map is measurable, either one uses the composition rule or 
checks X~1(A) c F for a ‘generator’ A of G. 

Random elements can be combined to produce new random elements by 
composition. One can show that if f is F/G-measurable and g is G/H-measurable 
for o-algebras F,G and H over appropriate spaces, then their composition g o f 
is F/H-measurable (Exercise 2.1). This is used most often for Borel functions, 
which is a special name for B(R™)/B(R”)-measurable functions from R” to 
R”. These functions are also called Borel measurable. The reader will find it 
pleasing that all familiar functions are Borel. First and foremost, all continuous 
functions are Borel, which includes elementary operations such as addition and 
multiplication. Continuity is far from essential, however. In fact one is hard- 
pressed to construct a function that is not Borel. This means the usual operations 
are ‘safe’ when working with random variables. 


Indicator Functions 
Given an arbitrary set Q and A C Q, the indicator function of A is 
Ia : Q —> {0,1} given by 


1, ifweA; 
I w — ? $ 
ale) i otherwise . 


Sometimes A has a complicated description, and it becomes convenient to abuse 
notation by writing I{w € A} instead of L4(w). Similarly, we will often write 
I {predicate(X,Y,...)} to mean the indicator function of the subset of Q on 
which the predicate is true. It is easy to check that an indicator function I, is a 
random variable on (Q, F) if and only if A is measurable: A € F. 
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Why So Complicated? 
You may be wondering why we did not define P on the power set of Q, which 
is equivalent to declaring that all sets are measurable. In many cases this is a 
perfectly reasonable thing to do, including the example game where nothing 
prevents us from defining F = 2°. However, beyond this example, there are two 
justifications not to have F = 2, the first technical and the second conceptual. 
The technical reason is highlighted by the following surprising theorem 
according to which there does not exist a uniform probability distribution on 
Q = [0,1] if F is chosen to be the power set of Q (a uniform probability distribution 
over [0,1], if existed, would have the property of assigning its length to every 
interval). In other words, if you want to be able to define the uniform measure, 
then F cannot be too large. By contrast, the uniform measure can be defined 
over the Borel o-algebra, though proving this is not elementary. 


THEOREM 2.3. Let Q = [0,1], and F be the power set of Q. Then there does not 
exist a measure P on (Q, F) such that P([a,b]) = b— a for all0<a<b<1. 


The main conceptual reason of why not to have F = 2° is because then we 
can use o-algebras to represent information. This is especially useful in the study 
of bandits where the learner is interacting with an environment and is slowly 
gaining knowledge. One useful way to represent this is by using a sequence of 
nested o-algebras, as we explain in the next section. One might also be worried 
that the Borel o-algebra does not contain enough measurable sets. Rest assured 
that this is not a problem and you will not easily find a non-measurable set. For 
completeness, an example of a non-measurable set will still be given in the notes, 
along with a little more discussion on this topic. 

A second technical reason to prefer the measure-theoretic approach to 
probabilities is that this approach allows for the unification of distributions 
on discrete spaces and densities on continuous ones (the uninitiated reader will 
find the definitions of these later). This unification can be necessary when dealing 
with random variables that combine elements of both, e.g. a random variable 
that is zero with probability 1/2 and otherwise behaves like a standard Gaussian. 
Random variables like this give rise to so-called “mixed continuous and discrete 
distributions”, which seem to require special treatment in a naive approach 
to probabilities, yet dealing with random variables like these are nothing but 
ordinary under the measure-theoretic approach. 


From Laws to Probability Spaces and Random Variables 

A big ‘conspiracy’ in probability theory is that probability spaces are seldom 
mentioned in theorem statements, despite the fact that a measure cannot be 
defined without one. Statements are instead given in terms of random elements 
and constraints on their joint probabilities. For example, suppose that X and Y 
are random variables such that 


P(X €A,Y g) = ANEI. OPI for all A,B € B(R), (21) 
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which represents the joint distribution for the values of a dice (X € [6]) and coin 
(Y € [2]). The formula describes some constraints on the probabilistic interactions 
between the outputs of X and Y, but says nothing about their domain. In a way, 
the domain is an unimportant detail. Nevertheless, one must ask whether or not 
an appropriate domain exists at all. More generally, one may ask whether an 
appropriate probability space exists given some constraints on the joint law of a 
collection X1,...,X, of random variables. For this to make sense, the constraints 
should not contradict each other, which means there is a probability measure 
u on %8(R*) such that u satisfies the postulated constraints. But then we can 
choose Q = R*, F = B(R*), P = u and X; : Q > R to be the ith coordinate 
map: X;(w) = w;. The push-forward of P under X = (X1,...,X%) is u, which by 
definition is compatible with the constraints. 

A more specific question is whether for a particular set of constraints on the 
joint law there exists a measure compatible with the constraints. Very often the 
constraints are specified for elements of the cartesian product of finitely many 
o-algebras, like in Eq. (2.1). If (Q1, 71),...,(Qn, Fn) are measurable spaces, then 
the cartesian product of F1,...F» is 


Fi ae x Fn = {Ai x +x A, : Ai E€ Fi., Ån E Fa} eee 
Elements of this set are known as measurable rectangles in 2; x --- X Qn. 


THEOREM 2.4 (Carathéodory’s extension theorem). Let (Q1, Fi), .--, (Qn, Fn) 
be measurable spaces and fi: Fy X +++ X Fn — [0,1] be a function such that 


(a) p(Qy x ++» X On) = 1; and 
(b) f(URL, An) = Pp A(Ak) for all sequences of disjoint sets with A, € 
Fi XX Fn. 


Let Q = Qı X -X Qn and F = o(Fı X +- X Fn). Then there exists a unique 
probability measure u on (Q, F) such that u agrees with p on Fy X +++ X Fn. 


The theorem is applied by letting Q, = R and Fk = B(R). Then the values of 
a measure on all cartesian products uniquely determines its value everywhere. 


It is not true that Fı x Fz = o (F1 x F2). Take, for example, F, = Fy = 241-7}. 
Then, |F, x Fo) = 14+ 3x 3 = 10 (because Ø x X = Ø), while, since 
F, x Fe includes the singletons of 2112x1123, o(F, x Fo) = 212x112, 
Hence, six sets are missing from Fı x Fə. For example, {(1,1),(2,2)} € 
a(Fi x Fo) \ Fi x E 


The o-algebra o(Fı x --- x Fn) is called the product o-algebra of (Fk)kejn] 
and is also denoted by Fı ®--:® Fn. The product operation turns out to be 
associative: (F1 @ Fo) ®F3 = Fi Q (F2 Q F3), which justifies writing F1 Q Fo Q F3. 
As it turns out, things work out well again with Borel o-algebras: for p,q € NF, 
B(R?+1) = B(R?) 9 B(R2). Needless to say, the same holds when there are more 
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than two terms in the product. The n-fold product o-algebra of F is denoted by 
Fen; 


o-algebras and knowledge 


One of the conceptual advantages of measure-theoretic probability is the 
relationship between o-algebras and the intuitive idea of ‘knowledge’. Although 
the relationship is useful and intuitive, it is regrettably not quite perfect. Let 
(Q, F), (¥,G) and (VY, H) be measurable spaces and X : Q > ¥ and Y : Q > V 
be random elements. Having observed the value of X (‘knowing X’), one might 
wonder what this entails about the value of Y. Even more simplistically, under 
what circumstances can the value of Y be determined exactly having observed X? 
The situation is illustrated in Fig. 2.3. As it turns out, with some restrictions, the 
answer can be given in terms of the o-algebras generated by X and Y. Except 


oF oan 


Ss 


(VY, H) 


Figure 2.3 The factorisation problem asks whether there exists a (measurable) function 
f that makes the diagram commute. 


for a technical assumption on (V, H), the following result shows that Y is a 
measurable function of X if and only if Y is o(X)/H-measurable. The technical 
assumption mentioned requires (Y, H) to be a Borel space, which is true of all 
probability spaces considered in this book, including (R*,8(R“)). We leave the 
exact definition of Borel spaces to the next chapter. 


LEMMA 2.5 (Factorisation lemma). Assume that (Y, H) is a Borel space. Then Y 
is o(X)-measurable (o(Y) C o(X)) if and only if there exists a G/H-measurable 
map f : X —> Y such that Y = fo X. 


In this sense o(X) contains all the information that can be extracted from X 
via measurable functions. This is not the same as saying that Y can be deduced 
from X if and only if Y is o(X)-measurable because the set of ¥ —> Y maps 
can be much larger than the set of G/H-measurable functions. When G is coarse, 
there are not many G/H-measurable functions with the extreme case occurring 
when G = {4,0}. In cases like this, the intuition that o(X) captures all there 
is to know about X is not true anymore (Exercise 2.6). The issue is that o(X) 
does not only depend on X, but also on the o-algebra of (V,G) and that if G is 
coarse-grained, then o(X) can also be coarse-grained and not many functions 
will be o(X)-measurable. If X is a random variable, then by definition ¥ = R 
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and G = %8(R), which is relatively fine-grained, and the requirement that f 
be measurable is less restrictive. Nevertheless, even in the nicest setting where 
Q=X=YV=Rand F = G = H = B(R), it can still occur that Y = f o X for 
some non-measurable f. In other words, all the information about Y exists in X 
but cannot be extracted in a measurable way. These problems only occur when 
X maps measurable sets in Q to non-measurable sets in X. Fortunately, while 
such random variables exist, they are never encountered in applications, which 
provides the final justification for thinking of o(X) as containing all that there is 
to know about any random variable X that one may ever expect to encounter. 


Filtrations 
In the study of bandits and other online settings, information is revealed to the 
learner sequentially. Let X1,...,X, be a collection of random variables on a 


common measurable space (Q, F). We imagine a learner is sequentially observing 
the values of these random variables. First X,, then X> and so on. The learner 
needs to make a prediction, or act, based on the available observations. Say, a 
prediction or an act must produce a real-valued response. Then, having observed 
X14 = (X1,...,X;), the set of maps fo X1. where f : Rt > R is Borel, captures 
all the possible ways the learner can respond. By Lemma 2.5, this set contains 
exactly the o(X1.4)/S8(R)-measurable maps. Thus, if we need to reason about 
the set of Q + R maps available after observing X1.;, it suffices to concentrate 
on the o-algebra F; = o(X1.4). Conveniently, F; is independent of the space of 
possible responses, and being a subset of F, it also hides details about the range 
space of X1.;. It is easy to check that Fo C Fy C Fo C -+ C Fn C F, which 
means that more and more functions are becoming *;-measurable as t increases, 
which corresponds to increasing knowledge (note that Fo = {0, Q}, and the set 
of Fo-measurable functions is the set of constant functions on Q). 

Bringing these a little further, we will often find it useful to talk about increasing 
sequences of o-algebras without constructing them in terms of random variables 
as above. Given a measurable space (Q, F), a filtration is a sequence (F;)?_9 of 
sub-o-algebras of F where F; C F;+, for all t < n. We also allow n = œo, and in 
this case we define 


Fa =o (ÜR) 


to be the smallest o-algebra containing the union of all F;. Filtrations can also 
be defined in continuous time, but we have no need for that here. A sequence 
of random variables (X+){—; is adapted to filtration F = (F;)?_, if X; is Fir 
measurable for each t. We also say in this case that (X+)+ is F-adapted. The 
same nomenclature applies if n is infinite. Finally, (X;,); is F-predictable if X; 
is F¿—ı-measurable for each t € [n]. Intuitively we may think of an F-predictable 
process X = (X;),; as one that has the property that X, can be known (or 
‘predicted’) based on F;_1, while a F-adapted process is one that has the property 
that X, can be known based on F, only. Since F;_1 C F;, a predictable process 
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is also adapted. A filtered probability space is the tuple (0,7,F,P), where 
(Q, F,P) is a probability space and F = (F;); is filtration of F. 


Conditional Probabilities 


Conditional probabilities are introduced so that we can talk about how 
probabilities should be updated when one gains some partial knowledge about a 
random outcome. Let (Q, F, P) be a probability space, and let A, B € F be such 
that P (B) > 0. The conditional probability P (A| B) of A given B is defined 
as 

P(ANB) 


P(A|B)= Foe 


We can think about the outcome w € 2 as the result of throwing a many-sided 
dice. The question asked is the probability that the dice landed so that w € A 
given that it landed with w € B. The meaning of the condition w € B is that we 
focus on dice rolls when w € B is true. All dice rolls when w € B does not hold 
are discarded. Intuitively, what should matter in the conditional probability of A 
given B is how large the portion of A is that lies in B, and this is indeed what 
the definition means. 


The importance of conditional probabilities is that they define a calculus of 
how probabilities are to be updated in the presence of extra information. 


The probability P (A| B) is also called the a posteriori (‘after the fact’) 
probability of A given B. The a priori probability is P (A). Note that P(A | B) is 
defined for every A € F as long as P (B) > 0. In fact, A > P (A | B) isa probability 
measure over the measure space (Q, F) called the a posteriori probability measure 
given B (see Exercise 2.7). In a way the temporal characteristics attached to 
the words ‘a posteriori’ and ‘a priori’ can be a bit misleading. Probabilities are 
concerned with predictions. They express the degrees of uncertainty one assigns 
to future events. The conditional probability of A given B is a prediction of 
certain properties of the outcome of the random experiment that results in w 
given a certain condition. Everything is related to a future hypothetical outcome. 
Once the dice is rolled, w gets fixed, and either w € A, B or not. There is no 
uncertainty left: predictions are trivial after an experiment is done. 

Bayes rule states that provided events A,B € F both occur with positive 
probability, 


P(B| A)P(A) 


P(A|B) = Som 


(2.2) 


Bayes rule is useful because it allows one to obtain P (A | B) based on information 
about the quantities on the right-hand side. Remarkably, this happens to be 
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the case quite often, explaining why this simple formula has quite a status in 
probability and statistics. Exercise 2.8 asks the reader to verify this law. 


Independence 


Independence is another basic concept of probability that relates to 
knowledge/information. In its simplest form, independence is a relation that 
holds between events on a probability space (Q, F, P). Two events A, B € F are 
independent if 


P(ANB) =P(A)P(B). (2.3) 


How is this related to knowledge? Assuming that P (B) > 0, dividing both sides 
by P (B) and using the definition of conditional probability, we get that the above 
is equivalent to 


P(A|B) =P(A) . (2.4) 


Of course, we also have that if P(A) > 0, (2.3) is equivalent to P (B | A) = P (B). 
Both of the latter relations express that A and B are independent if the probability 
assigned to A (or B) remains the same regardless of whether it is known that B 
(respectively, A) occurred. 

We hope our readers will find the definition of independence in terms of a ‘lack 
of influence’ to be sensible. The reason not to use Eq. (2.4) as the definition is 
mostly for the sake of convenience. If we started with (2.4), we would need to 
separately discuss the case of P (B) = 0, which would be cumbersome. A second 
reason is that (2.4) suggests an asymmetric relationship, but intuitively we expect 
independence to be symmetric. 

Uncertain outcomes are often generated part by part with no interaction 
between the processes, which naturally leads to an independence structure (think 
of rolling multiple dice with no interactions between the rolls). Once we discover 
some independence structure, calculations with probabilities can be immensely 
simplified. In fact, independence is often used as a way of constructing probability 
measures of interest (cf. Eq. (2.1), Theorem 2.4 and Exercise 2.9). Independence 
can also appear serendipitously in the sense that a probability space may hold 
many more independent events than its construction may suggest (Exercise 2.10). 


You should always carefully judge whether assumptions about independence 
are really justified. This is part of the modelling and hence is not 
mathematical in nature. Instead you have to think about the physical 
process being modelled. 


A collection of events G C F is said to be pairwise independent if any two 
distinct elements of G are independent of each other. The events in G are said 
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to be mutually independent if for any n > 0 integer and A;,..., A, distinct 
elements of G, P(A, N --- N An) = []j_, P (4:). This is a stronger restriction 
than pairwise independence. In the case of mutually independent events, the 
knowledge of joint occurrence of any finitely many events from the collection will 
not change our prediction of whether some other event in the collection happens. 
But this may not be the case when the events are only pairwise independent 
(Exercise 2.10). Two collections of events G1, G2 are said to be independent of 
each other if for any A € G, and B € Gz it holds that A and B are independent. 
This definition is often applied to o-algebras. 

When the o-algebras are induced by random variables, this leads to the 
definition of independence between random variables. Two random 
variables X and Y are independent if o(X) and o(Y) are independent of each 
other. The notions of pairwise and mutual independence can also be naturally 
extended to apply to collections of random variables. All these concepts can be 
and are in fact extended to random elements. 

The default meaning of independence when multiple events or random variables 
are involved is mutual independence. 


When we say that X1,..., Xn are independent random variables, we mean 
that they are mutually independent. Independence is always relative to 
some probability measure, even when a probability measure is not explicitly 
mentioned. In such cases the identity of the probability measure should be 
clear from the context. 


Integration and Expectation 


A key quantity in probability theory is the expectation of a random variable. Fix 
a probability space (Q, F, P) and random variable X : Q — R. The expectation X 
is often denoted by E[X]. This notation unfortunately obscures the dependence 
on the measure P. When the underlying measure is not obvious from context, we 
write Ep to indicate the expectation with respect to P. Mathematically, we define 


the expected value of X as its Lebesgue integral with respect to P: 


[X] = n X(w) dP(w). 


The right-hand side is also often abbreviated to f X dP. The integral on the 
right-hand side is constructed to satisfy the following two key properties: 


(a) The integral of indicators is the probability of the underlying event. If X(w) = 
I{w € A} is an indicator function for some A € F, then f XdP = P(A). 


(b) Integrals are linear. For all random variables X1, Xə and reals aj, a2 such 
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that f XıdP and f X2dP are defined, [(a1X1 + a2X2)dP is defined and 


satisfies 
[lax + apX2) dP = a f Xi dP +02 | XodP. (2.5) 
Q Q Q 
These two properties together tell us that whenever X(w) = X; ail {w € Ai} 
for some n, a; E€ R and A; E F,i=1,...,n, then 
T XdP = X` aP (A;) . (2.6) 
g i 


Functions of the form X are called simple functions. 

In defining the Lebesgue integral of some random variable X, we use (2.6) as 
the definition of the integral when X is a simple function. The next step is to 
extend the definition to non-negative random variables. Let X : Q — [0,00) be 
measurable. The idea is to approximate X from below using simple functions 
and take the largest value that can be obtained this way: 


[ xar=suf | hdP : h is simple andoche xh, (2.7) 
Q Q 


The meaning of U < V for random variables U,V is that U(w) < V(w) for all 
w E Q. The supremum on the right-hand side could be infinite, in which case we 
say the integral of X is not defined. Whenever the integral of X is defined, we 
say that X is integrable or, if the identity of the measure P is unclear, that X 
is integrable with respect to P. Note that since we are taking the supremum of 
nonnegative values, Jo XdP >Q. 

Integrals for arbitrary random variables are defined by decomposing the 
random variable into positive and negative parts. Let X : Q —> R be any 
measurable function. Then define X+(w) = X(w)I{X(w) > 0} and X~ (w) = 
—X(w)I{X(w) < 0} so that X(w) = X+(w) — X- (w). Now X* and X- are 
both non-negative random variables called the positive and negative parts of 
X. Provided that both X* and X~ are integrable, we define 


f xæ= f xta- f xap 
2 Q Q 


and we say that X is integrable. Note that a random variable X is integrable if and 
only if the non-negative-valued random variable |X| is integrable (Exercise 2.13). 


None of what we have done depends on P being a probability measure. The 
definitions hold for any measure, though for signed measures it is necessary to 
split Q into disjoint measurable sets on which the measure is positive/negative, 
an operation that is possible by the Hahn decomposition theorem. We 
will never need signed measures in this book, however. 


A particularly interesting case is when Q = R is the real line, F = (R) is 
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the Borel o-algebra and the measure is the Lebesgue measure A, which is the 
unique measure on B(R) such that A((a, b)) = b—a for any a < b. In this scenario, 
if f : R > R is a Borel-measurable function, then we can write the Lebesgue 
integral of f with respect to the Lebesgue measure as 


[ia 


Perhaps unsurprisingly, this almost always coincides with the improper Riemann 
integral of f, which is normally written as i. f(x)dx. Precisely, if |f| is both 
Lebesgue integrable and Riemann integrable, then the integrals are equal. 


There exist functions that are Riemann integrable and not Lebesgue 
integrable, and also the other way around (although examples of the former 
are more exotic than the latter). 


The Lebesgue measure and its relation to Riemann integration is mentioned 
because when it comes to actually calculating the value of an expectation or 
integral, this is often reduced to calculating integrals over the real line with 
respect to the Lebesgue measure. The calculation is then performed by evaluating 
the Riemann integral, thereby circumventing the need to rederive the integral 
of many elementary functions. Integrals (and thus expectations) have a number 
of important properties. By far the most important is their linearity, which was 
postulated above as the second property in (2.5). To practice using the notation 
with expectations, we restate the first half of this property. In fact, the statement 
is slightly more general than what we demanded for integrals above. 


PROPOSITION 2.6. Let (X;); be a (possibly infinite) sequence of random variables 
on the same probability space and assume that E[X;] exists for all i and 
furthermore that X = X; X; and E[)°, |X;|] also exist. Then 


[X] = $E [X] . 


This exchange of expectations and summation is the source of much magic 
in probability theory because it holds even if X; are not independent. This 
means that (unlike probabilities) we can very often decouple the expectations of 


dependent random variables, which often proves extremely useful (a collection 
of random variables is dependent if they are not independent). You will prove 
Proposition 2.6 in Exercise 2.15. The other requirement for linearity is that if 
c € R is a constant, then E [cX] = cE [X] (Exercise 2.16). 

Another important statement is concerned with independent random variables. 


PROPOSITION 2.7. If X and Y are independent and either E || X|] ,E [|Y |] < co 
or E[|XY]|] < co, then E[XY] = E |X] E [Y]. 


In general E [XY] #4 E [X] E [Y] (Exercise 2.19). Finally, an important simple 
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result connects expectations of non-negative random variables to their tail 
probabilities. 


PROPOSITION 2.8. If X > 0 is a non-negative random variable, then 


Lx] = f P(X > a)ae. 


The integrand in Proposition 2.8 is called the tail probability function 
x œ> P(X >x) of X. This is also known as the complementary cumulative 
distribution function of X. The cumulative distribution function (CDF) of 
X is defined as x > P (X < x) and is usually denoted by Fx. These functions 
are defined for all random variables, not just non-negative ones. One can check 
that Fx : R —> [0,1] is increasing, right continuous and lim,-,-4. Fx (x) = 0 and 
limz—+oo Fx (a) = 1. The CDF of a random variable captures every aspect of the 
probability measure Px induced by X, while still being just a function on the real 
line, a property that makes it a little more human friendly than Px. One can also 
generalise CDFs to random vectors: if X is an R*-valued random vector, then its 
CDF is defined as the Fy : R? — [0,1] function that satisfies Fy (x) = P(X < 2), 
where, in line with our conventions, X < x means that all components of X are 
less than or equal to the respective component of x. The pushforward Px of a 
random element is an alternative way to summarise the distribution of X. In 
particular, for any real-valued, f : ¥ — R measurable function, 


[AX] = I Towe) 


provided that either the right-hand side, or the left-hand side exist. This is known 
as the “law of the unconscious statistician”, or LOTUS, because it is so frequently 
used. 


Conditional Expectation 


Conditional expectation allows us to talk about the expectation of a random 
variable given the value of another random variable, or more generally, given 
some g-algebra. 


EXAMPLE 2.9. Let (Q, F,P) model the outcomes of an unloaded dice: Q = [6], 
F = 2? and P(A) = |A|/6. Define two random variables X and Y by 
Y (w) =I{w > 3} and X(w) = w. Suppose we are interested in the expectation 
of X given a specific value of Y. Arguing intuitively, we might notice that Y = 1 
means that the unobserved X must be either 4, 5 or 6, and that each of these 
outcomes is equally likely, and so the expectation of X given Y = 1 should 
be (4 + 5 + 6)/3 = 5. Similarly, the expectation of X given Y = 0 should be 
(1+ 2+ 3)/3 = 2. If we want a concise summary, we can just write that ‘the 
expectation of X given Y’ is 5Y + 2(1—Y). Notice how this is a random variable 
itself. 
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The notation for this conditional expectation is E [X |Y]. Using this notation, 
in Example 2.9 we can concisely write E[X |Y] = 5Y + 2(1—Y). A little more 
generally, if X : Q > X and Y : Q > Y with ¥, Y C R and |A4|,|¥| < co, then 
E[X |Y] : Q — R is the random variable given by ELX | Y| (w) = E[X | Y = Y(w)], 
where 


[X| Y =y = X eP(X=2|Y=y)=>> 


LEX LEX 


cP(X =2,Y =y) 
P(Y =y) 


(2.8) 


This is undefined when P(Y = y) = 0 so that E[X |Y] (w) is undefined on the 
measure zero set {w : P(Y = Y (w)) = 0}. 

Eq. (2.8) does not generalise to continuous random variables because P (Y = y) 
in the denominator might be zero for all y. For example, let Y be a random 
variable taking values on [0,1] according to a uniform distribution and X € {0,1} 
be Bernoulli with bias Y. This means that the joint measure on X and Y is 
P(X =1,Y € [p,q]) = i. xdz for 0 < p < q < 1. Intuitively it seems like ELX | Y] 
should be equal to Y, but how to define it? The mean of a Bernoulli random 
variable is equal to its bias so the definition of conditional probability shows that 
forO<p<q<l, 


[X =1|Y € [p,q] = P(X =1|Y € [p,q]) 
_ P(X=1,Y € [p,d) 
P(Y € [p,q]) 
yoy 
2(q — p) 
Pre 
2 


This calculation is not well defined when p = q because P(Y € [p,p]) = 0. 
Nevertheless, letting q = p + for € > 0 and taking the limit as € tends to zero 
seems like a reasonable way to argue that P(X =1|Y = p) = p. Unfortunately 
this approach does not generalise to abstract spaces because there is no canonical 
way of taking limits towards a set of measure zero, and different choices lead to 
different answers. 

Instead we use Eq. (2.8) as the starting point for an abstract definition of 
conditional expectation as a random variable satisfying two requirements. First, 
from Eq. (2.8) we see that E[X|Y](w) should only depend on Y(w) and so 
should be measurable with respect to o(Y). The second requirement is called the 
‘averaging property’. For measurable A C Y, Eq. (2.8) shows that 


o(ly-1(ayE[X | Y]] = SO P(Y = y)E[X |Y = y] 
yEA 


=X X rP(X=x,Y =y) 


yEA LEX 
= S[Ty-1(4)X] A 
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This can be viewed as putting a set of linear constraints on E[X |Y] with one 
constraint for each measurable A C Y. By treating E[X |Y] as an unknown 
o(Y)-measurable random variable, we can attempt to solve this linear system. As 
it turns out, this can always be done: the linear constraints and the measurability 
restriction on E [X |Y] completely determine E[X |Y] except for a set of measure 
zero. Notice that both conditions only depend on o(Y) C F. The abstract 
definition of conditional expectation takes these properties as the definition and 
replaces the role of Y with a sub-o-algebra. 


DEFINITION 2.10 (Conditional expectation). Let (Q, F, P) be a probability space 
and X : Q — R be random variable and H be a sub-o-algebra of F. The 
conditional expectation of X given H is denoted by E[X |H] and defined to be 
any H-measurable random variable on Q such that for all H € H, 


| IX | Hae = f xap. (2.9) 


Given a random variable Y, the conditional expectation of X given Y is 
[X |Y] =E[X|o(Y)]. 


THEOREM 2.11. Given any probability space (N, F, P), a sub-o-algebra H of F 
and a P-integrable random variable X : Q — R, there exists an H-measurable 
function f : Q — R that satisfies (2.9). Further, any two H-measurable functions 
fi, fo: Q — R that satisfy (2.9) are equal with probability one: P( fi = fo) = 1. 


When random variables X and Y agree with P-probability one, we say they 
are P-almost surely equal, which is often abbreviated to ‘X = Y P-a.s’, or 
‘X =Y a.s? when the measure is clear from context. A related useful notion is 
the concept of null sets: U € F is a null set of P, or a P-null set if P(U) = 0. 
Thus, X = Y P-a.s. if and only if X = Y agree except on a P-null set. 


The reader may find it odd that ELX | Y] is a random variable on 2 rather 
than the range of Y. Lemma 2.5 and the fact that E[X |o(Y)] is o(Y)- 
measurable shows there exists a measurable function f : (R, B(R)) > 
(R, B(R)) such that ELX | o(Y)](w) = (f o Y)(w) (see Fig. 2.4). In this sense 
E[X | Y](w) only depends on Y(w), and occasionally we write E[X | Y](y). 


Returning to Example 2.9, we see that E[X |Y] = E[X |o(Y)] and o(Y) = 
{{1, 2,3}, {4,5, 6}, ø, Q}. Denote this set-system by H for brevity. The condition 
that E[X |H] is H-measurable can only be satisfied if E[X | H] (w) is constant on 
{1,2,3} and {4,5,6}. Then (2.9) immediately implies that 


2, ifw c€ {1,2,3}; 
5, if w€ {4,5,6}. 


ta 700) =| 


While the definition of conditional expectations given above is non-constructive 
and E[X |H] is uniquely defined only up to events of P-measure zero, none of 
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¥ RIXIY] 


(R, B(R)) 


Figure 2.4 Factorisation of conditional expectation. When there is no confusion, we 
occasionally write E[X | Y] (y) in place of f(y). 


this should be of a significant concern. First, we will rarely need closed-form 
expressions for conditional expectations, but we rather need how they relate to 
other expectations, conditional or not. This is also the reason why it should not 
be concerning that they are only determined up to zero probability events: usually, 
conditional expectations appear in other expectations or in statements that are 
concerned with how probable some event is, making the difference between the 
different ‘versions’ of conditional expectations disappear. 

We close the section by summarising some additional important properties of 
conditional expectations. These follow from the definition directly, and the reader 
is invited to prove them in Exercise 2.21. 


THEOREM 2.12. Let (Q,7,P) be a probability space, G,Gi,G2 C F be sub-o- 
algebras of F and X,Y integrable random variables on (Q,F,P). The following 
hold true: 


1 If X >0, then E[X |G] > 0 almost surely. 

2 E[1|G] =1 almost surely. 

3 E[X+Y |G] =E[X |G]+E[Y |G] almost surely. 

4 E[XY |G] =YE[X |G] almost surely if E[XY] exists and Y is G-measurable. 

5 If Gi C Go, then E[X | G1] = E[E[X | Go] |Gi] almost surely. 

6 If o(X) is independent of G2 given Gi, then E[X |o(G1 UG2)] = E[X |G] 
almost surely. 

7 If G = {0,Q} is the trivial o-algebra, then E[X |G] = E [X] almost surely. 


Properties 1 and 2 are self-explanatory. Property 3 generalises the linearity of 
expectation. Property 4 shows that a measurable quantity can be pulled outside 
of a conditional expectation and corresponds to the property that for constants 
c, E[cX] = cE[X]. Property 5 is called the tower rule or the law of total 
expectations. It says that the fineness of E[X | G2] is obliterated when taking the 
conditional expectation with respect to G1. Property 6 relates independence and 
conditional expectations, and it says that conditioning on independent quantities 
does not give further information on expectations. Here, the two event systems A 
and B are said to be conditionally independent of each other given a o-algebra 
F if for all A € A and B € B, P(AN B|F) = P(A|F)P(B|F) holds almost 
surely. We also often say that A is conditionally independent of B given F, but 


ies 
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of course, this relation is symmetric. This property is often applied with random 
variables: X is said to be conditionally independent of Y given Z, if o(X) is 
conditionally independent of o (Y ) given o(Z). In this case, E [X | Y, Z] = E [X | Z] 
holds almost surely. Property 7 states that conditioning on no information gives 


the same expectation as not conditioning at all. 


The above list of abstract properties will be used over and over again. We 
encourage the reader to study the list carefully and convince yourself that 
all items are intuitive. Playing around with discrete random variables can 
be invaluable for this. Eventually it will all become second nature. 


Notes 


1 The Greek letter ø is often used by mathematicians in association with 


countable infinities. Hence the term o-algebra (and o-field). Note that countable 
additivity is often called o-additivity. The requirement that additivity should 
hold for systems of countably infinitely many sets is made so that probabilities 
of (interesting) limiting events are guaranteed to exist. 

Measure theory is concerned with measurable spaces, measures and with 
their properties. An obvious distinction between probability theory and measure 
theory is that in probability theory, one is (mostly) concerned with probability 
measures. But the distinction does not stop here. In probability theory, the 
emphasis is on the probability measures and their relations to each other. The 
measurable spaces are there in the background, but are viewed as part of the 
technical toolkit rather than the topic of main interest. Also, in probability 
theory, independence is often at the center of attention, while independence is 
not a property measure-theorists care much about. 

In our toy example, instead of Q = [6]’, we could have chosen Q = [6]® 
(considering rolling eight dice instead of seven, one dice never used). There are 
many other possibilities. We can consider coin flips instead of dice rolls (think 
about how this could be done). To make this easy, we could use weighted coins 
(for example, a coin that lands on heads with probability 1/6), but we don’t 
actually need weighted coins (this may be a little tricky to see). The main 
point is that there are many ways to emulate one randomisation device by 
using another. The difference between these is the set 2. What makes a choice 
of Q viable is if we can emulate the game mechanism on the top of Q so that 
in the end the probability of seeing any particular value remains the same. But 
the main point is that the choice of Q is far from unique. The same is true for 
the way we calculate the value of the game! For example, the dice could be 
reordered, if we stay with the first construction. This was noted already, but it 
cannot be repeated frequently enough: the biggest conspiracy in all probability 
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theory is that we first make a big fuss about introducing 9, and then it turns 
out that the actual construction of Q does not matter. 
All Riemann-integrable functions on a bounded domain are Lebesgue integrable. 


Difficulties only arise when taking improper integrals. A standard example 
co sin(x)dx 
0 x 

not Lebesgue integrable because f (0,00) |sin(x)/a|dx = co. The situation is 


analogous to the difference between conditionally and absolutely convergent 
series, with the Lebesgue integral only defined in the latter case. 

Can you think of a set that is not Borel measurable? Such sets exist, but do not 
arise naturally in applications. The classic example is the Vitali set, which is 
formed by taking the quotient group G = R/Q and then applying the axiom 
of choice to choose a representative in [0,1] from each equivalence class in G. 
Non-measurable functions are so unusual that you do not have to worry much 
about whether or not functions X : R —> R are measurable. With only a few 
exceptions, questions of measurability arising in this book are not related to 
the fine details of the Borel o-algebra. Much more frequently they are related 
to filtrations and the notion of knowledge available having observed certain 
random elements. 

There is a lot to say about why the sum, or the product of random variables 
are also random variables. Or why inf, Xn, sup,, Xn, liminf, Xn, limsup,, Xn 
are measurable when X,, are. The key point is to show that the composition of 
measurable maps is a measurable map and that continuous maps are measurable 
and then apply these results (Exercise 2.1). For limsup,, Xn, just rewrite it as 
LiMn —+oo SUPp>m Xn; note that sup,sm Xn is decreasing (we take suprema of 


is , which is an improper Riemann integrable function, but is 


smaller sets as m increases), hence limsup,, Xn = infm sup, >, Xn, reducing 
the question to studying inf, Xn and sup, Xn. Finally, for inf, Xn note that 
it suffices if {w : inf, Xn >t} is measurable for any t real. Now, inf, Xn >t 
if and only if Xn > t for all n. Hence, {w : inf, Xn >t} = Nnw : Xn > th, 
which is a countable intersection of measurable sets, hence measurable (this 
latter follows by the elementary identity (N;A;)° = U; A$). 

The factorisation lemma, Lemma 2.5, is attributed to Joseph Doob and Eugene 
Dynkin. The lemma sneakily uses the properties of real numbers (think about 
why), which is another reason why what we said about o-algebras containing 
all information is not entirely true. The lemma has extensions to more general 
random elements [Taraldsen, 2018, for example]. The key requirement in a 
way is that the o-algebra associated with the range space of Y should be rich 
enough. 

We did not talk about basic results like Lebesgue’s dominated/monotone 
convergence theorems, Fatou’s lemma or Jensen’s inequality. We will definitely 
use the last of these, which is explained in a dedicated chapter on convexity 
(Chapter 26). The other results can be found in the texts we cite. They are 
concerned with infinite sequences of random variables and conditions under 
which their limits can be interchanged with Lebesgue integrals. In this book 
we rarely encounter problems related to such sequences and hope you forgive 
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us on the few occasions they are necessary (the reason is simply because we 
mostly focus on finite time results or take expectations before taking limits 
when dealing with asymptotics). 

You might be surprised that we have not mentioned densities. For most of 
us, our first exposure to probability on continuous spaces was by studying the 
normal distribution and its density 


1 2 
x) = —— exp(—2*/2), 2.10 
p(x) ie p(—a" /2) (2.10) 
which can be integrated over intervals to obtain the probability that a Gaussian 
random variable will take a value in that interval. The reader should notice 
that p : R > R is Borel measurable and that the Gaussian measure associated 
with this density is P on (R, B(R)) defined by 


P(A) = f pa. 


Here the integral is with respect to the Lebesgue measure \ on (R, B(R)). The 
notion of a density can be generalised beyond this simple setup. Let P and Q 
be measures (not necessarily probability measures) on arbitrary measurable 
space (Q, F). The Radon—Nikodym derivative of P with respect to Q is 
an F-measurable random variable aa : Q — [0,00) such that 


P(A) = a dQ forall AEF. (2.11) 
a dQ 

We can also write this in the form f I4dP = fI a51Q, A € F, from which we 
may realise that for any X P-integrable random variable, f XdP = f X 954Q 
must also hold. This is often called the change-of-measure formula. Another 
word for the Radon—Nikodym derivative 5 is the density of P with respect to 
Q. It is not hard to find examples where the density does not exist. We say that 
P is absolutely continuous with respect to Q if Q(A) =0 = P(A) =0 
for all A € F. When © exists, it follows immediately that P is absolutely 
continuous with respect to Q by Eq. (2.11). Except for some pathological cases, 
it turns out that this is both necessary and sufficient for the existence of dP/dQ. 
The measure Q is o-finite if there exists a countable covering {A;} of Q with 
F-measurable sets such that Q(A;) < co for each i. 


THEOREM 2.13. Let P,Q be measures on a common measurable space (9, F) 
and assume that Q is o-finite. Then the density of P with respect to Q, To 
exists if and only if P is absolutely continuous with respect to Q. Furthermore, 
5 is uniquely defined up to a Q-null set so that for any fı, fo satisfying (2.11), 
fi = f2 holds Q-almost surely. 


Densities work as expected. Suppose that Z is a standard Gaussian random 
variable. We usually write its density as in Eq. (2.10), which we now know 
is the Radon—Nikodym derivative of the Gaussian measure with respect to 


10 


11 


12 


13 


14 
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the Lebesgue measure. The densities of ‘classical’ continuous distributions are 
almost always defined with respect to the Lebesgue measure. 


In line with the literature, we will use P < Q to denote that P is absolutely 
continuous with respect to Q. When P is absolutely continuous with respect 
to Q, we also say that Q dominates P. 


A useful result for Radon—Nikodym derivatives is the chain rule, which states 
that if P< Q « S, then 5 a = oF . The proof of this result follows from our 
earlier observation that f fdQ = f Fag for any Q-integrable f. Indeed, the 
chain rule is obtained from this by taking f =I ASG with A € F and noting 
that this is indeed Q-integrable and f 1445 dQ = J ladP. The chain rule is 
often used to reduce the calculation of Jenckes to calculation with known 
densities. 


The Radon-Nikodym derivative unifies the notions of distribution (for discrete 
spaces) and density (for continuous spaces). Let Q be discrete (finite or 
countable) and let p be the counting measure on (Q, 2°), which is defined 
by p(A) = |A|. For any P on (Q, F), it is easy to see that P < p and 
AO) = P({i}), which is sometimes called the distribution function of P. 


The Radon—Nikodym derivative provides another way to define the conditional 
expectation. Let X be an integrable random variable on (0, F, P) and H C F 
bea cer eal a P| be the restriction of P to (Q, H). Define measure 
u on (Q, H) by w(A) = f, XdP|y. It is easy to check that y « P| and 
that E[X |H] = ae satisfies Eq. (2.9). We note that the proof of the 
Radon-—Nikodym theorem is nontrivial and that the existence of conditional 
expectations are more easily guaranteed via an ‘elementary’ but abstract 


argument using functional analysis. 


The Fubini—Tonelli theorem, which we will also refer to as Fubini’s theorem, 
is a powerful result that allows one to exchange the order of integrations. This 
result is needed for example for proving Proposition 2.8 (Exercise 2.20). To state 
it, we need to introduce product measures. These work as expected: given two 
probability spaces, (01,71, P1) and (Q2, F2, P2), the product measure P of Pı 
and P» is defined as any measure on (Q1 x Q2, #1 ®F2) that satisfies P(A, A2) = 
P,(A1)P2(A2) for all (A1, Az) € Fi X Fo (recall that Fy ® Fo = o(Fı x Fe) is 
the product o-algebra of 7; and Fy). Theorem 2.4 implies that this product 
measure, which is often denoted by Pı x P2 (or Pı ® P2) is uniquely defined. 
(Think about what this product measure has to do with independence.) The 
Fubini-Tonelli theorem (often just ‘Fubini’) states the following: let (Q1, F1, Pi) 
and (Q2, F2,P2) be two probability spaces and consider a random variable 
X on the product probability space (Q,F7,P) = (Q1 x Q2, Fi 8 Fe, Pi x P2). 
If any of the three integrals f |X(w)|dP(w), [Cf |X (w1, w2)| dP1(w1)) dP2(w2), 
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SS |X (w1, w2)| dP2(w2)) dPı (w1) is finite, then 


[xe ) dP(w = | ( fx (w1, w2) dP; (w1) ) dP2(w2) 
E J Xr) aPa(u)) APs (01). 


For topological space X, the support of a measure u on (X,B(X)) is 


Supp(u) = {x € X : (U) > 0 for all neighborhoods U of x}. 


When X is discrete, this reduces to Supp(#) = {x : u({x}) > 0}. 

Let X be a topological space. The weak* topology on the space of probability 
measures P(X) on (X,8(X)) is the coarsest topology such that pH f fdp 
is continuous for all bounded continuous functions f : X —> R. In particular, 
a sequence of probability measures (un); converges to u in this topology 
if and only if limpo f fdun = f fdu for all bounded continuous functions 
f:X OR. 


THEOREM 2.14. When X is compact and Hausdorff and P(X) is the space of 
regular probability measures on (X,B(X)) with the weak* topology, then P(X) 
is compact. 


Mathematical terminology can be a bit confusing sometimes. Since E maps 
(certain) functions to real values, it is also called the expectation operator. 
‘Operator’ is just a fancy name for a function. In operator theory, the study 
of operators, the focus is on operators whose domain is infinite dimensional, 
hence the distinct name. However, most results of operator theory do not 
hinge upon this property. If the image space is the set of reals, we talk about 
functionals. The properties of functionals are studied in yet another subfield of 
mathematics, functional analysis. The expectation operator is a functional 
that maps the set of P-integrable functions (often denoted by L'(Q,P) or 
L*(P)) to reals. Its most important property is linearity, which was stated as 
a requirement for integrals that define the expectation operator (Eq. (2.5)). 
In line with the previous comment, when we use E, more often than not, the 
probability space remains hidden. As such, the symbol E is further abused. 


Bibliographic Remarks 


Much of this chapter draws inspiration from David Pollard’s A user’s guide to 
measure theoretic probability [Pollard, 2002]. We like this book because the author 
takes a rigourous approach, but still explains the ‘why’ and ‘how’ with great 
care. The book gets quite advanced quite fast, concentrating on the big picture 
rather than getting lost in the details. Other useful references include the book by 
Billingsley [2008], which has many good exercises and is quite comprehensive in 
terms of its coverage of the ‘basics’. These books are both quite detailed. For an 
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outstanding shorter introduction to measure-theoretic probability, see the book 
by Williams [1991], which has an enthusiastic style and a pleasant bias towards 
martingales. We also like the book by Kallenberg [2002], which is recommended 
for the mathematically inclined readers who already have a good understanding of 
the basics. The author has put a major effort into organising the material so that 
redundancy is minimised and generality is maximised. This reorganisation resulted 
in quite a few original proofs, and the book is comprehensive. The factorisation 
lemma (Lemma 2.5) is stated in the book by Kallenberg [2002] (Lemma 1.13 
there). Kallenberg calls this lemma the ‘functional representation’ lemma and 
attributes it to Joseph Doob. Theorem 2.4 is a corollary of Carathéodory’s 
extension theorem, which says that probability measures defined on semi-rings of 
sets have a unique extension to the generated o-algebra. The remaining results can 
be found in either of the three books mentioned above. Theorem 2.14 appears as 
theorem 8.9.3 in the two-volume book by Bogachev [2007]. Finally, for something 
older and less technical, we recommend the philosophical essays on probability 
by Pierre Laplace, which was recently reprinted [Laplace, 2012]. 


Exercises 


2.1 (COMPOSING RANDOM ELEMENTS) Show that if f is #/G-measurable and g 
is G/H-measurable for sigma algebras #,G and H over appropriate spaces, then 
their composition, go f (defined the usual way: (go f)(w) = g(f(w)), w € Q), is 
F /H-measurable. 


2.2 Let X1,...,X» be random variables on (Q, F). Prove that (X1,...,Xn) is 
a random vector. 


2.3 (RANDOM VARIABLE INDUCED o-ALGEBRA) Let U be an arbitrary set and 
(V, x) a measurable space and X : U —> V an arbitrary function. Show that 
Ux ={X71(A) : A€ E} is a o-algebra over U. 


2.4 Let (Q, F) be a measurable space and AC Q and Aa ={ANB: Be F}. 


(a) Show that (A, \,4) is a measurable space. 
(b) Show that if A € F, then Fa ={B: BeEF,BC A}. 


2.5 Let G C 2° be a non-empty collection of sets and define o(G) as the smallest 
o-algebra that contains G. By ‘smallest’ we mean that F € 2° is smaller than 
FoeQitF oF. 


(a) Show that o(G) exists and contains exactly those sets A that are in every 
o-algebra that contains G. 

(b) Suppose (Q’, F) is a measurable space and X : Q’ —+ Q be F/G-measurable. 
Show that X is also F/o(G)-measurable. (We often use this result to simplify 
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the job of checking whether a random variable satisfies some measurability 
property). 
(c) Prove that if A € F where F is a o-algebra, then I {A} is F-measurable. 


2.6 (KNOWLEDGE AND o-ALGEBRAS: A PATHOLOGICAL EXAMPLE) In the context 
of Lemma 2.5, show an example where Y = X and yet Y is not o(X) measurable. 


Hint As suggested after the lemma, this can be arranged by choosing 
Q = V = X = R, XW) = Yw) = w, F = H = B(R) and G = {0,R} to 
be the trivial o-algebra. 


2.7 Let (Q, F,P) be a probability space, B € F be such that P (B) > 0. Prove 
that A > P (A| B) is a probability measure over (Q, F). 


2.8 (BAYES LAW) Verify (2.2). 


2.9 Consider the standard probability space (9, F, P) generated by two standard, 
unbiased, six-sided dice that are thrown independently of each other. Thus, 
Q = {1,...,6}2, F = 2° and P(A) = |A|/6? for any A € F so that X;(w) = wi 
represents the outcome of throwing dice 7 € {1,2}. 


(a) Show that the events ‘X, < 2’ and ‘Xo is even’ are independent of each 
other. 

(b) More generally, show that for any two events, A € o(X,) and B € o(X2), 
are independent of each other. 


2.10 (SERENDIPITOUS INDEPENDENCE) The point of this exercise is to understand 
independence more deeply. Solve the following problems: 


(a) Let (Q, F,P) be a probability space. Show that Ø and Q (which are events) 
are independent of any other event. What is the intuitive meaning of this? 

(b) Continuing the previous part, show that any event A € F with P(A) € {0,1} 
is independent of any other event. 

(c) What can we conclude about an event A € F that is independent of its 
complement, Af = Q \ A? Does your conclusion make intuitive sense? 

(a) What can we conclude about an event A € F that is independent of itself? 
Does your conclusion make intuitive sense? 

(e) Consider the probability space generated by two independent flips of unbiased 
coins with the smallest possible o-algebra. Enumerate all pairs of events 
A, B such that A and B are independent of each other. 

(£) Consider the probability space generated by the independent rolls of two 
unbiased three-sided dice. Call the possible outcomes of the individual dice 
rolls 1, 2 and 3. Let X; be the random variable that corresponds to the 
outcome of the ith dice roll (i € {1,2}). Show that the events {X1 < 2} and 
{X, = X2} are independent of each other. 

(g) The probability space of the previous example is an example when the 
probability measure is uniform on a finite outcome space (which happens to 
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have a product structure). Now consider any n-element, finite outcome space 
with the uniform measure. Show that A and B are independent of each other 
if and only if the cardinalities |A|, |B|, |AN B| satisfy n|AM B| = |A| - |B|. 

(h) Continuing with the previous problem, show that if n is prime, then no 
non-trivial events are independent (an event A is trivial if P (A) € {0,1}). 

(i) Construct an example showing that pairwise independence does not imply 
mutual independence. 

(j) Is it true or not that A,B,C are mutually independent if and only if 
P(AN BNC) = P(A) P(B)P(C)? Prove your claim. 


2.11 (INDEPENDENCE AND RANDOM ELEMENTS) Solve the following problems: 


(a) Let X be a constant random element (that is, X(w) = x for any w € Q over 
the outcome space over which X is defined). Show that X is independent of 
any other random variable. 

(b) Show that the above continues to hold if X is almost surely constant (that 
is, P(X = x) = 1 for an appropriate value <x). 

(c) Show that two events are independent if and only if their indicator random 
variables are independent (that is, A,B are independent if and only if 
X(w) =I{w € A} and Y(w) =I {w € B} are independent of each other). 

(d) Generalise the result of the previous item to pairwise and mutual 
independence for collections of events and their indicator random variables. 


2.12 If X < Y and X > 0 then E [X] < E [Y]. Further, the statement continues 
to hold even when X is allowed to take on both positive and negative values and 
if both X and Y are integrable. 


2.13 Our goal in this exercise is to show that a random variable X is integrable 
if and only if |X| is integrable. This is broken down into multiple steps. The first 
issue is to deal with the measurability of |X|. While a direct calculation can also 
show this, it may be worthwhile to follow a more general path: 


(a) Any f : R > R continuous function is Borel measurable. 

(b) Conclude that for any random variable X, |X| is also a random variable. 

(c) Prove that for any random variable X, X is integrable if and only if |X| 
is integrable. (The statement makes sense since |X| is a random variable 
whenever X is). 


Hint For (b) recall Exercise 2.1. For (c) examine the relationship between 
|X| and (X)* and (X)~. 


2.14 (INFINITE-VALUED INTEGRALS) Can we consistently extend the definition 
of integrals so that for non-negative random variables, the integral is always 
defined (it may be infinite)? Defend your view by either constructing an example 
(if you are arguing against) or by proving that your definition is consistent with 
the requirements we have for integrals. 
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2.15 Prove Proposition 2.6. 


HINT You may find it useful to use Lebesgue’s dominated /monotone convergence 
theorems. 


2.16 Prove that if c € R is a constant, then E [cX] = cE [X] (as long as X is 
integrable). 


2.17 Prove Proposition 2.7. 


HINT Follow the ‘inductive’ definition of Lebesgue integrals, starting with simple 
functions, then non-negative functions and finally arbitrary independent random 
variables. To finish you may want to use Lebesgue’s dominated convergence 
theorem. 


2.18 Suppose that Gı C Gz and prove that E[LX |G,] = E[E[X | G1] | G2] almost 
surely. 


2.19 Demonstrate using an example that in general, for dependent random 
variables, E [XY] = E [X] E [Y] does not hold. 


2.20 Prove Proposition 2.8. 


Hint Argue that X(w) = Jo æ) Í {[0, X (w)]} (x) da and exchange the integrals. 
Use the Fubini—Tonelli theorem to justify the exchange of integrals. 


2.21 Prove Theorem 2.12. 


Stochastic Processes and Markov 
Chains (<) 


The measure-theoretic probability in the previous chapter covers almost all the 
definitions required. Occasionally, however, infinite sequences of random variables 
arise, and for these a little more machinery is needed. We expect most readers 
will skip this chapter on the first reading, perhaps referring to it when necessary. 

Before one can argue about the properties of infinite sequences of random 
variables, it must be demonstrated that such sequences exist under certain 
constraints on their joint distributions. For example, does there exist an infinite 
sequence of random variables such that any finite subset of the random variables 
are independent and distributed like a standard Gaussian? The first theorem 
provides conditions under which questions like this can be answered positively. 
This allows us to write, for example, ‘let (X,)°°, be an infinite sequence of 
independent standard Gaussian random variables’ and be comfortable knowing 
there exists a probability space on which these random variables can be defined. 
To state the theorem, we need the concept of Borel spaces. 

Two measurable spaces (4, F) and (Y,G) are said to be isomorphic if there 
exists a bijective function f : ¥ — Y such that f is F/G-measurable and f~! is 
G/F-measurable. A Borel space is a measurable space (X, F) that is isomorphic 
to (A, 8(A)) with A € B(R) a Borel measurable subset of the reals. This is not 
a very strong assumption. For example, (R”, 8(R”)) is a Borel space, along with 
all of its measurable subsets. 


THEOREM 3.1. Let u be a probability measure on a Borel space S and A be the 
Lebesgue measure on ([0, 1], B([0, 1]). Then there exists a sequence of independent 
random elements X1,X2,... on ([0, 1], B([0, 1]), A) such that the law Ax, = u for 
all t. 


We give a sketch of the proof because, although it is not really relevant for 
the material in this book, it illustrates the general picture and dispels some of 
the mystic about what is really going on. Exercise 3.1 asks you to provide the 
missing steps from the proof. 


Proof sketch of Theorem 3.1 For simplicity we consider only the case that 
S = ((0,1],8((0,1])) and u is the Lebesgue measure. For any x € [0,1], let 
F(x), Fo(x),... be the binary expansion of x, which is the unique binary-valued 
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infinite sequence such that 


c=) Fie". 
t=1 


We can view Fj, Fo,... as (binary-valued) random variables over the probability 
space ([0,1], B([0,1]), à). Viewed as such, a direct calculation shows that 
Fı, Fo,... are independent. From this we can create an infinite sequence of 
uniform random variables by reversing the process. To do this, we rearrange the 
(F;)?21 sequence into a grid. For example: 

I, Fo, Fa, Pr 

Fz, Fs, Fg 

Fe, F9,--- 

Fios 


Letting Xm, be the tth entry in the mth row of this grid, we define Xm = 
D2 Xin, and again one can easily check that with this choice the sequence 
X 1, X2,...is independent and Ax, = u is uniform for each t. 


Stochastic Processes 


Let T be an arbitrary set. A stochastic process on probability space (Q, F, P) 
is a collection of random variables {X; : t € T}. In this book 7 will always 
be countable, and so in the following we restrict ourselves to 7 = N. The first 
theorem is not the most general, but suffices for our purposes and is more easily 
stated than more generic alternatives. 


THEOREM 3.2. For each n € Nt, let (Qn, Fn) be a Borel space and un be a 
measure on (Qı x +++ X On, Fi Q- Q Fn) and assume that un and Hn+1 are 
related through 


Un+1(A X On41) = Un(A) for all AED, @---@Qn. (3.1) 


Then there exists a probability space (Q,F,P) and random elements X1, Xo,... 
with Xi : Q —> Q; such that Px... X„ = Hn for alln. 


Sequences of measures (Hn)n satisfying Eq. (3.1) are called projective. 


Theorem 3.1 follows immediately from Theorem 3.2. By assumption a random 
variable takes values in (R, B(R)), which is Borel. Then let un = @?_,p be 
the n-fold product measure of u with itself. That this sequence of measures is 
projective is clear, and the theorem does the rest. 
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Markov Chains 


A Markov chain is an infinite sequence of random elements (X;)?2, where the 
conditional distribution of X;4, given X,,..., X+ is the same as the conditional 
distribution of X41 given X+. The sequence has the property that given the last 
element, the history is irrelevant to ‘predict’ the future. Such random sequences 
appear throughout probability theory and have many applications besides. The 
theory is too rich to explain in detail, so we give the basics and point towards the 
literature for more details at the end. The focus here is mostly on the definition 
and existence of Markov chains. 

Let (X, F) and (V, G) be measurable spaces. A probability kernel or Markov 
kernel from (X, F) to (VY, G) is a function K : ¥ x G > [0,1] such that 


(a) K(a,-) is a measure for all z € X; and 
(b) K(-,A) is F-measurable for all A € G. 


The idea here is that K describes a stochastic transition. Having arrived at x, a 
process’s next state is sampled Y ~ K(2,-). Occasionally, we will use the notation 
K,,(A) or K(A| 7x) rather than K(x, A). 

If Kı is a (¥,F) —> (¥,G) probability kernel and Kə is a (VY, G) > (Z,H) 
probability kernel, then the product kernel Kı ® Kə is the probability kernel 
from (X, F) > (V x Z,G ®H) defined by 


(Ke Kale A)= f f tay.) Kav. de) Ki (ad). 


When P is a measure on (V,F) and K is a kernel from ¥ to y, then PQ K isa 
measure on (4 x VY, F @G) defined by 


(POKA) = f, | ta(em)K(edy)aPCa). 


These operations can be composed. When P is a probability measure on ¥ and 
Kı a kernel from ¥ to Y and Kə a kernel from ¥ x Y to Z, then P ® K1 ® Ko 
is a probability measure on ¥ x Y x Z. The following provides a counterpart of 
Theorem 3.2. 


THEOREM 3.3 (Ionescu-Tulcea). Let (Qn, Fn); be a sequence of measurable 
spaces and Kı be a probability measure on (Q1, Fı). For n > 2, let Ky 
be a probability kernel from Ha Q; to Qn. Then there exists a probability 
space (Q, F,P) and random elements (X+); with X; : Q —> Q, such that 
Px,,...X, = Qı K; for alln E€ Nt. 


A homogeneous Markov chain is a sequence of random elements (X+)? 
taking values in state space S = (4, F) and with 


P(Xt41 ee | Xi, sae Xa) = P(Xt41 ‘es | Xz) = (Xz, -) almost surely , 


where p is a probability kernel from (4,F) to (4,F) and we assume that 
P(X, € -) = wo(-) for some measure uo on (X, F). 
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The word ‘homogeneous’ refers to the fact that the probability kernel does 
not change with time. Accordingly, sometimes one writes ‘time homogeneous 
instead of homogeneous. The reader can no doubt see how to define a Markov 
chain where u depends on t, though doing so is purely cosmetic since the 


a 


state space can always be augmented to include a time component. 


Note that if u(x |-) = uol) for all x € X, then Theorem 3.3 is yet another 
way to prove the existence of an infinite sequence of independent and identically 
distributed random variables. The basic questions in Markov chains resolve around 
understanding the evolution of X; in terms of the probability kernel. For example, 
assuming that Q; = Qı for all t € Nt, does the law of X; converge to some fixed 
distribution as t > oo, and if so, how fast is this convergence? For now we make 
do with the definitions, but in the special case that ¥ is finite, we will discuss 
some of these topics much later in Chapters 37 and 38. 


Martingales and Stopping Times 


Let X1, X2,... be a sequence of random variables on (Q,F,P) and F = (F;)?_, 
a filtration of F and where we allow n = oo. Recall that the sequence (X;)?_, is 
F-adapted if X; is #;-measurable for all 1 < t< n. 


DEFINITION 3.4. A F-adapted sequence of random variables (X;)en, is a F- 
adapted martingale if 


(a) E[X: | Fi-1] = Xz-1 almost surely for all t € {2,3,...}; and 
(b) X; is integrable. 


If the equality is replaced with a less-than (greater-than), then we call (X;); a 
supermartingale (respectively, a submartingale). 


The time index t need not run over N+. Very often t starts at zero instead. 


EXAMPLE 3.5. A gambler repeatedly throws a coin, winning a dollar for each 
heads and losing a dollar for each tails. Their total winnings over time is a 
martingale. To model this situation, let Y1, Y2,... be a sequence of independent 
Rademacher distributions, which means that P (Y; = 1) = P (Y; = —1) = 1/2. 
The winnings after t rounds is S; = $‘; Ys, which is a martingale adapted to 
the filtration (F;)?2, given by F; = 0(¥1,..., Y+). The definition of super/sub- 
martingales (the direction of inequality) can be remembered by remembering 
that the definition favors the casino, not the gambler. 


Can a gambler increase its expected winning by stopping cleverly? Precisely, 
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the gambler at the end of round t can decide to stop (ô+ = 1) or continue (6; = 0) 
based on the information available to them. Denoting by T = min{t : 6, = 1} 
the time when the gambler stops, the question is whether by a clever choice of 
(dz)zen, E [S+] can be made positive. Here, (d;)4en, a sequence of binary, F-adapted 
random variables, is called a stopping rule, while 7 is a stopping time with 


respect F 


Note that the stopping rule is not allowed to inject additional randomness 
beyond what is already there in F. 


DEFINITION 3.6. Let F = (Fijen be a filtration. A random variable r with values 
in NU {oo} is a stopping time with respect to F if I{r < t} is F;-measurable 
for all t € N. The o-algebra at stopping time 7 is 


F, ={A€ Fæ: AN{T <t} © F for all t}. 


The filtration is usually indicated by writing ‘7 is a F-stopping time’. When 
the underlying filtration is obvious from context, it may be omitted. This is 
also true for martingales. 


Using the interpretation of o-algebras encoding information, if (F;), is thought 
of as the knowledge available at time t, F, is the information available at the 
random time 7. Exercise 3.7 asks you to explore properties of stopped o-algebras; 
amongst other things, it asks you to show that F, is in fact a o-algebra. 


EXAMPLE 3.7. In the gambler example, the first time when the gambler’s 
winnings hits 100 is a stopping time: 7 = min{t : S; = 100}. On the other 
hand, 7 = min{t : 5:4; = —1} is not a stopping time because I {r = t} is not 
F;-measurable. 


Whether or not E[.S;] can be made positive by a clever choice of a stopping 
time 7 is answered in the negative by a fundamental theorem of Doob: 


THEOREM 3.8 (Doob’s optional stopping). Let F = (Fi)ren be a filtration and 
(Xt)ten be an F-adapted martingale and T an F-stopping time such that at least 
one of the following holds: 


(a) There exists ann E€ N such that P (T >n) =0. 

(b) E[t] < oo, and there exists a constant c € R such that for allt € N, 
[| Xe41 — Xıl | F] < c almost surely on the event that T > t. 

(c) There exists a constant c such that |Xt,7| < c almost surely for allt € N. 


Then X, is almost surely well defined, and E[X,] = E[|Xo]. Furthermore, when 


(Xı) is a super/sub-martingale rather than a martingale, then equality is replaced 
with less/greater-than, respectively. 
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The theorem implies that if S, is almost-surely well defined then either 
i [7] = œ or E[S,] = 0. Gamblers trying to outsmart the casino would need to 
live a very long life! One application of Doob’s optional stopping theorem is a 
useful and a priori surprising generalisation of Markov’s inequality to non-negative 
supermartingales. 


THEOREM 3.9 (Maximal inequality). Let (X;z)?2, be a supermartingale with 
X, > 0 almost surely for all t. Then for any £ > 0, 


OLX, 
P (spx; >e) < E[Xo] 
teN E 


Proof Let A, be the event {sup;<n X: > €} and 7 = (n+1)Amin{t < n: X; > 
E€}, where the minimum of an empty set is assumed to be infinite so that rT = n+1 
if X, < € for all 0 < t < n. Clearly 7 is a stopping time and P(r <n+1) = 1. 
Then by Theorem 3.8 and elementary calculation, 


[Xo] > E[X,] > E[X,I {r < n}] > Efel {r < n}] = eP (r < n) = eP (An) , 


where the second inequality uses the definition of the stopping time and the non- 
negativity of the supermartingale. Rearranging shows that P (An) < E[Xo]/e for 
all n € N. Since Ay C Ao C..., it follows that P (supyey Xt > £) = P(UnenAn) < 
#| Xo] /e. 


Markov’s inequality (which we will cover in the next chapter) combined with 
the definition of a supermartingale shows that 
E[Xo] 


Pk ee, (3.2) 


In fact, in the above we have effectively applied Markov’s inequality to the 
random variable X, (the need for the proof arises when the conditions of 
Doob’s optional stopping theorem are not met). The maximal inequality is 
a strict improvement over Eq. (3.2) by replacing Xn with sup,cy X: at no 
cost whatsoever. 


A similar theorem holds for submartingales. You will provide a proof in 
Exercise 3.8. 


THEOREM 3.10. Let (X1)P-9 be a submartingale with X+ > 0 almost surely for 
allt. Then for any € > 0, 


= 
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Notes 


Some authors include in the definition of a stopping time 7 that P(t < co) = 1 
and call random times without this property Markov times. We do not adopt 
this convention and allow stopping times to be infinite with non-zero probability. 
Stopping times are also called optional times. 

There are several notations for probability kernels depending on the application. 
The following are commonly seen and equivalent: K(x, A) = K(A|a) = K,(A). 
For example, in statistics a parametric family is often given by {Po : 6 € O}, 
where © is the parameter space and Pọ is a measure on some measurable space 
(Q, F). This notation is often more convenient than writing P(0,-). In Bayesian 
statistics the posterior is a probability kernel from the observation space to 
the parameter space, and this is often written as P(- |x). 

There is some disagreement about whether or not a Markov chain on an 
uncountable state space should instead be called a Markov process. In this 
book we use Markov chain for arbitrary state spaces and discrete time. When 
time is continuous (which it never is in this book), there is general agreement 
that ‘process’ is more appropriate. For more history on this debate, see [Meyn 
and Tweedie, 2012, preface]. 

A topological space ¥ is Polish if it is separable and there exists a metric 
d that is compatible with the topology that makes (X,d) a complete metric 
space. All Polish spaces are Borel spaces. We follow Kallenberg [2002], but 
many authors use standard Borel space rather than Borel space, and define 
it as the o-algebra generated by the open sets of a Polish space. 

In Theorem 3.2 it was assumed that each un was defined on a Borel space. 
No such assumption was required for Theorem 3.3, however. One can derive 
Theorem 3.2 from Theorem 3.3 by using the existence of regular conditional 
probability measures when conditioning on random elements taking values 
in a Borel space (see the next note). Topological assumptions often creep 
into foundational questions relating to the existence of probability measures 
satisfying certain conditions, and pathological examples show these assumptions 
cannot be removed completely. Luckily, in this book we have no reason to 
consider random elements that do not take values in a Borel space. 

The fact that conditional expectation is only unique almost surely can be 
problematic when you want a conditional distribution. Given random elements 
X and Y on the same probability space, it seems reasonable to hope that 
P(X €-|Y) is a probability kernel from the space of Y to that of X. A 
version of the conditional distributions that satisfies this is called a regular 
version. In general, there is no guarantee that such a regular version exist. The 
basic properties of conditional expectation only guarantee that for any fixed 
measurable A, P(X € A|Y) is unique up to a set of measure zero. The set of 
measure zero can depend on A, which causes problems when there are ‘too 
many’ measurable sets in the space of X. Assuming X lives in a Borel space, 
the following theorem guarantees the existence of a conditional distribution. 
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THEOREM 3.11 (Regular conditional distributions). Let X and Y be random 
elements on the same probability space (Q,F,P) taking values in measurable 
spaces X and Y, and assume that X is Borel. Then there exists a probability 
kernel K from Y to X such that K( |Y) = P(X €-|Y) P-almost surely. 
Furthermore, K is unique in the sense that for any kernels Kı and Kə satisfying 
this condition, it holds that Ky(-|y) = K2(- |y) for all y in some set of Py- 
measure one. 


The theorem implies the useful relation that Px y = Py K (cf. Exercise 3.9) 
where recall that for a random variable Z, Pz denotes its pushforward under 
P. To make the origin K clear, we often write Pyy instead of K. With this, 
the above equality becomes Px ,y = Py ® Px,y, which can be viewed as the 
converse of the Ionescu-Tulcea theorem (Theorem 3.3). Sometimes this is called 


the chain rule of probabilities measures. 

You can also condition on a o-algebra G C F, in which case K is a probability 
kernel from (Q,G) to X. The condition that ¥ be Borel is sufficient, but not 
necessary. Some conditions are required, however. An example where no regular 
version exists can be found in [Halmos, 1976, p210]. Regular versions play 
a role in the following useful theorem for decomposing random variables on 
product spaces. 


THEOREM 3.12 (Disintegration). Let X and Y be random elements on the 
same probability space taking values in measurable spaces X and Y. Let f be 
a random variable on X x Y so that E[|f(X,Y)|] < co. Suppose that K is a 
regular version of P(X €-|G) and Y is G-measurable. Then, 


tf (X,Y) |G] = i f(a, Y)K(dx|-) almost surely. 


In many applications G = o(Y), in which case the theorem says that 
F(X,Y) |Y] = fy f(#, Y)K (dz | Y) almost surely. Proofs of both theorems 
appear in chapter 6 of Kallenberg [2002]. More advanced theorems, e.g., when 
X = (X1, X2,...) in Theorem 3.11 can be a real-valued stochastic process (for 
which the corresponding space 7 = RN has too large of a cardinality to be 
a Borel space), are also available. See, for example Section 7.2 of Chow and 
Teicher [1997]. 


Bibliographic Remarks 


There are many places to find the construction of a stochastic process. Like 
before, we recommend Kallenberg [2002] for readers who want to refresh their 
memory and Billingsley [2008] for a more detailed account. One of the authors 
also likes Chow and Teicher [1997] very much as it is relatively short, but has a 
lot of content in it. For Markov chains the recent book by Levin and Peres [2017] 
provides a wonderful introduction. After reading that, you might like the tome 
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by Meyn and Tweedie [2012]. Theorem 3.1 can be found as theorem 3.19 in the 
book by Kallenberg [2002], where the reader can also find its proof. Theorem 3.2 
is credited to Percy John Daniell by Kallenberg [2002] (see Aldrich 2007). More 
general versions of this theorem exist. Readers looking for these should look 
up Kolmogorov’s extension theorem [Kallenberg, 2002, theorem 6.16]. The 
theorem of Ionescu-Tulcea (Theorem 3.3) is attributed to him [Ionescu Tulcea, 
1949-50] with a modern proof in the book by Kallenberg [2002, theorem 6.17]. 
There are lots of minor variants of the optional stopping theorem, most of which 
can be found in any probability book featuring martingales. The most historically 
notable source is by the man himself [Doob, 1953]. A more modern book that 
also gives the maximal inequalities is the book on optimal stopping by Peskir 
and Shiryaev [2006]. 


Exercises 


3.1 Fill in the details of Theorem 3.1: 


(a) Prove that F, € {0,1} is a Bernoulli random variable for all t > 1. 

(b) In what follows, equip S with P = 4, the uniform probability measure. Show 
that for any t > 1, F; is uniformly distributed: P (F, = 0) = P (F, = 1) = 1/2. 

(c) Show that (F;)?2, are independent. 

(a) Show that (Xmz)?2, is an independent sequence of Bernoulli random 
variables that are uniformly distributed. 

(e) Show that X; = 072, Xm 27 is uniformly distributed on [0, 1]. 

(£) Show that (X,)?2, are independent. 


3.2 (MARTINGALES AND OPTIONAL STOPPING) Let (X;)?2, be an infinite 
sequence of independent Rademacher random variables and S; = Xt; X52°1. 


(a) Show that (S;)?25 is a martingale. 

(b) Let 7 = min{t : S; = 1} and show that P (T < œ) =1. 

(c) What is E[S]? 

(a) Explain why this does not contradict Doob’s optional stopping theorem. 


3.3 (MARTINGALES AND OPTIONAL STOPPING (II)) Give an example of a 
martingale (S;,)°2.9 and stopping time 7 such that 


lim E[S;an] 4 E[S;]. 


n— Co 


3.4 (MAXIMAL INEQUALITY FAILS WITHOUT NON-NEGATIVITY) Show that 
Theorem 3.9 does not hold in general for supermartingales if the assumption that 
it be non-negative is dropped. 


3.5 Let (Q, F) and (X, G) be measurable spaces, X : Æ — R be a random variable 
and K : Q x G —> [0,1] a probability kernel from (Q, F) to (4,G). Define function 
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U : Q >R by U(w) = fy X(«)K(w, dx) and assume that U (w) exists for all w. 
Prove that U is F-measurable. 


3.6 (LIMITS OF INCREASING STOPPING TIMES ARE STOPPING TIMES) Let (T),)°2 1 
be an almost surely increasing sequence of F-stopping times on probability space 
(Q, F,P) with filtration F = (F,,)°,, which means that 7,(w) < Tn+1(w) for all 


n=) 
n > 1 almost surely. Prove that T(w) = limpo Tn(w) is a F-stopping time. 


3.7 (PROPERTIES OF STOPPING TIMES) Let F = (F;)zen be a filtration, and 
T, T1, T2 be stopping times with respect to F. Show the following: 


(a) F, is a o-algebra. 

(b) If T = k for some k > 1, then F, = Fx. 

(c) If Ti < 72, then F,, C Frs. 

(d) 7 is F,-measurable. 

(e) If (X+) is F-adapted, then X, is #,-measurable. 

(£) F, is the smallest o-algebra such that all F-adapted sequences (X;) satisfy 
X, is F,-measurable. 


3.8 Prove Theorem 3.10. 


3.9 (DECOMPOSING JOINT DISTRIBUTIONS) Let X and Y be random elements 
on the same probability space (Q, F,P) taking values in measurable spaces æ 
and y respectively and assume that ¥ is Borel. Show that P(x,y) = Py 8 Px y 
where Px \y denotes a regular conditional distribution of X and Y (the existence 
of which is guaranteed by Theorem 3.11). 


4.1 


Stochastic Bandits 


The goal of this chapter is to formally introduce stochastic bandits. The model 
introduced here provides the foundation for the remaining chapters that treat 
stochastic bandits. While the topic seems a bit mundane, it is important to be 
clear about the assumptions and definitions. The chapter also introduces and 
motivates the learning objectives, and especially the regret. Besides the definitions, 
the main result in this chapter is the regret decomposition, which is presented in 
Section 4.5. 


Core Assumptions 


A stochastic bandit is a collection of distributions v = (P, : a € A), where A is 
the set of available actions. The learner and the environment interact sequentially 
over n rounds. In each round t € {1,...,n}, the learner chooses an action A; € A, 
which is fed to the environment. The environment then samples a reward X, € R 
from distribution P4, and reveals X, to the learner. The interaction between 
the learner (or policy) and environment induces a probability measure on the 
sequence of outcomes A1, X1, A2, X2,-.-,An, Xn. Usually the horizon n is finite, 
but sometimes we allow the interaction to continue indefinitely (n = o0). The 
sequence of outcomes should satisfy the following assumptions: 


(a) The conditional distribution of reward X; given A1, X1,..., A¢—1, Xt—1, At 
is Pa,, which captures the intuition that the environment samples X; from 


P4, in round t. 
(b) The conditional law of action A; given A1, Xj,...,A:-1,Xt-1 is 
mil | A1, X1,..-, At-1, X¢-1), where 7, 72,... is a sequence of probability 


kernels that characterise the learner. The most important element of this 
assumption is the intuitive fact that the learner cannot use the future 
observations in current decisions. 


A mathematician might ask whether there even exists a probability space carrying 
these random elements such that (a) and (b) hold. Specific constructions showing 
this in the affirmative are given in Section 4.6. These constructions are also 
valuable because they teach us important lessons about equivalent models. For 
now, however, we move on. 
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The Learning Objective 


The learner’s goal is to maximise the total reward Sp = >/_, Xz, which is a 
random quantity that depends on the actions of the learner and the rewards 
sampled by the environment. This is not an optimisation problem for three 
reasons: 


1 What is the value of n for which we are maximising? Occasionally prior 
knowledge of the horizon is reasonable, but very often the learner does not 
know ahead of time how many rounds are to be played. 

2 The cumulative reward is a random quantity. Even if the reward distributions 
were known, then we require a measure of utility on distributions of Sn. 

3 The learner does not know the distributions that govern the rewards for each 
arm. 


Of these points, the last is fundamental to the bandit problem and is discussed 
in the next section. The lack of knowledge of the horizon is usually not a serious 
issue. Generally speaking it is possible to first design a policy assuming the 
horizon is known and then adapt it to account for the unknown horizon while 
proving that the loss in performance is minimal. This is almost always quite easy, 
and there exist generic approaches for making the conversion. 

Assigning a utility to distributions of S, is more challenging. Suppose 
that S, is the revenue of your company. Fig. 4.1 shows the distribution of 
S, for two different learners; call them A and B. Suppose you can choose 
between learners A and B. Which one would you choose? One choice is to 
go with the learner whose reward distribution has the larger expected value. 
This will be our default choice for 
stochastic bandits, but it bears remem- 
bering that there are other consider- 
ations, including the variance or tail 
behaviour of the cumulative reward, 
which we will discuss occasionally. In 


Density 


particular, in the situation shown on 
in Fig. 4.1, learner B achieves a higher 
expected reward than A. However B 
has a reasonable probability of earning 
less than the least amount that A can 
earn, so a risk-sensitive user may prefer Figure 4.1 Alternative revenue distributions 
learner A. 


Reward 


Knowledge and Environment Classes 


Even if the horizon is known in advance and we commit to maximising the expected 
value of Sn, there is still the problem that the bandit instance v = (P, : a € A) is 
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unknown. A policy that maximises the expectation of S» for one bandit instance 
may behave quite badly on another. The learner usually has partial information 
about v, which we represent by defining a set of bandits € for which v € € is 
guaranteed. The set € is called the environment class. We distinguish between 
structured and unstructured bandits. 


Unstructured Bandits 
An environment class € is unstructured if A is finite and there exist sets of 
distributions M, for each a € A such that 


E = {v = (Pa : a € A): Pa E Ma for all a € A}, 


or, in short, E = Xaea Ma. The product structure means that by playing action 
a the learner cannot deduce anything about the distributions of actions b F a. 

Some typical choices of unstructured bandits are listed in Table 4.1. Of course, 
these are not the only choices, and the reader can no doubt find ways to construct 
more, e.g. by allowing some arms to be Bernoulli and some Gaussian, or have 
rewards being exponentially distributed, or Gumbel distributed, or belonging to 
your favourite (non-)parametric family. 

The Bernoulli, Gaussian and uniform distributions are often used as examples 
for illustrating some specific property of learning in stochastic bandit problems. 
The Bernoulli distribution is actually a natural choice. Think of applications like 
maximising click-through rates in a web-based environment. A bandit problem 
is often called a ‘distribution bandit’, where ‘distribution’ is replaced by the 
underlying distribution from which the pay-offs are sampled. Some examples 
are: Gaussian bandit, Bernoulli bandit or subgaussian bandit. Similarly we say 
‘bandits with X’, where ‘X’ is a property of the underlying distribution from 
which the pay-offs are sampled. For example, we can talk about bandits with 
finite variance, meaning the bandit environment where the a priori knowledge of 
the learner is that all pay-off distributions are such that their underlying variance 
is finite. 

Some environment classes, like Bernoulli bandits, are parametric, while others, 
like subgaussian bandits, are non-parametric. The distinction is the number of 
degrees of freedom needed to describe an element of the environment class. When 
the number of degrees of freedom is finite, it is parametric, and otherwise it is 
non-parametric. Of course, if a learner is designed for a specific environment class 
E, then we might expect that it has good performance on all bandits v € €. Some 
environment classes are subsets of other classes. For example, Bernoulli bandits 
are a special case of bandits with a finite variance, or bandits with bounded 
support. Something to keep in mind is that we expect that it will be harder to 
achieve a good performance in a larger class. In a way, the theory of finite-armed 
stochastic bandits tries to quantify this expectation in a rigourous fashion. 
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Name Symbol Definition 

Bernoulli Ek {(B(yi))i : u € [0,1]®} 

Uniform E {(U(ai, bi))i : a,b € R? with a; < b; for all i} 
Gaussian (known var.) ER (o?) {(N (mi, 07))i : u E RY 

Gaussian (unknown var.) Ex; {(N (ui, 02))i : p E€ R! and o° € [0, 00)*} 
Finite variance Ekla?) {(P)i : Vxxp,[X] < o? for all i} 

Finite kurtosis Ekale)  {(Pi)i : Kurtx~p,[X] < « for all i} 

Bounded support Eke] {(P;)i : Supp(P;) C [a, b]} 

Subgaussian Ekla’) {(Pi)i : Pi is o-subgaussian for all i} 


Table 4.1 Typical environment classes for stochastic bandits. Supp( P) is the (topological) 
support of distribution P. The kurtosis of a random variable X is a measure of its tail 
behaviour and is defined by E[(X — ELX])*]/VLX]?. Subgaussian distributions have similar 
properties to the Gaussian and will be defined in Chapter 5. 


Structured Bandits 

Environment classes that are not unstructured are called structured. Relaxing the 
requirement that the environment class is a product set makes structured bandit 
problems much richer than the unstructured set-up. The following examples 
illustrate the flexibility. 


EXAMPLE 4.1. Let A = {1,2} and E = {(6(0),6(1 — 0)) : 0 € [0, 1]}. In this 
environment class, the learner does not know the mean of either arm, but can 
learn the mean of both arms by playing just one. The knowledge of this structure 
dramatically changes the difficulty of learning in this problem. 


EXAMPLE 4.2 (Stochastic linear bandit). Let A C R4 and 0 € R? and 
vo = (N((a, 8), 1) : a€ A) and E = {v : 0 € R9}. 


In this environment class, the reward of an action is Gaussian, and its mean is given 
by the inner product between the action and some unknown parameter. Notice 
that even if A is extremely large, the learner can deduce the true environment 
by playing just d actions that span R4. 


EXAMPLE 4.3. Consider an undirected graph G with vertices V = {1,...,|V]|} 
and edges E = {1,...,|E|}. In each round the learner chooses a path from 
vertex 1 to vertex |V|. Then each edge e € [E] is removed from the graph with 
probability 1 — ĝe for unknown 90 € [0,1]!"!. The learner succeeds in reaching 
their destination if all the edges in their chosen path are present. This problem 


ier 
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can be formalised by letting A be the set of paths and 


Vg = (s (11e) saca) and E = {v : 0 € [0,1]'*}. 


eEa 


An important feature of structured bandits is that the learner can often 
obtain information about some actions while never playing them. 


The Regret 


In Chapter 1 we informally defined the regret as being the deficit suffered by the 
learner relative to the optimal policy. Let v = (P, : a € A) be a stochastic bandit 
and define 


pal) = f zdrao). 


Then let u* (v) = MaXxacA Halv) be the largest mean of all the arms. 


We assume throughout that ua(v) exists and is finite for all actions and 
that argmax,¢ 4 Hal Y) is non-empty. The latter assumption could be relaxed 
by carefully adapting all arguments using nearly optimal actions, but in 
practice this is never required. 


The regret of policy 7 on bandit instance v is 


R,(1,v) = nu” (v) — [yx (4.1) 


where the expectation is taken with respect to the probability measure on 
outcomes induced by the interaction of 7 and v. Minimising the regret is equivalent 
to maximising the expectation of Sn, but the normalisation inherent in the 
definition of the regret is useful when stating results, which would otherwise need 
to be stated relative to the optimal action. 


If the context is clear, we will often drop the dependence on v and 7 in various 
quantities. For example, by writing Rn = nu* — E|); X+]. Similarly, the 
limits in sums and maxima are abbreviated when we think you can work 
out ranges of symbols in a unique way, e.g. w* = max; Hi. 


The regret is always non-negative, and for every bandit v, there exists a policy 
m for which the regret vanishes. 
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LEMMA 4.4. Letv be a stochastic bandit environment. Then, 


(a) Rn(T, v) > 0 for all policies x; 
(b) the policy n choosing A, E€ argmax, Ha for all t satisfies Ry(a,v) =0; and 
Cc) if Ra(n, v) =0 for some policy x, then P (ua, = u*) =1 for allt € [n]. 


We leave the proof for the reader (Exercise 4.1). Part (b) of Lemma 4.4 shows 
that for every bandit v, there exists a policy for which the regret is zero (the best 
possible outcome). According to Part (c), achieving zero is possible if and only if 
the learner knows which bandit it is facing (or at least, what is the optimal arm). 
In general, however, the learner only knows that v € € for some environment 
class €. So what can we hope for? A relatively weak objective is to find a policy 
m with sublinear regret on all v € E. Formally, this objective is to find a policy m 
such that 


for allv € €, lim 


If the above holds, then at least the learner is choosing the optimal action almost 
all of the time as the horizon tends to infinity. One might hope for much more, 
however, for example, that for some specific choice of C > 0 and p < 1 that 


for allv eE, Rn(T, v) < Cn? . (4.2) 


Yet another alternative is to find a function C : E > [0,co) and f : N > [0, 00) 
such that 


fralne N, veg, Ri(t,v) < C(v) f(n). (4.3) 


This factorisation of the regret into a function of the instance and a function 
of the horizon is not uncommon in learning theory and appears in particular in 
supervised learning. 

We will spend a lot of time in the following chapters finding policies satisfying 
Eq. (4.2) and Eq. (4.3) for different choices of €. The form of Eq. (4.3) is quite 
general, so much time is also spent discovering what are the possibilities for f and 
C, both of which should be ‘as small as possible’. All of the policies are inspired 
by the simple observation that in order to make the regret small, the algorithm 
must discover the action/arm with the largest mean. Usually this means the 
algorithm should play each arm some number of times to form an estimate of 
the mean of that arm, and subsequently play the arm with the largest estimated 
mean. The question essentially boils down to discovering exactly how often the 
learner must play each arm in order to have reasonable statistical certainty that 
it has found the optimal arm. 

There is another candidate objective called the Bayesian regret. If Q is a 
prior probability measure on € (which must be equipped with a o-algebra F), 
then the Bayesian regret is the average of the regret with respect to the prior Q. 


BR»(a,Q) = [ Ro. v) dQ(v), (4.4) 
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which is only defined by assuming (or proving) that the regret is a measurable 
function with respect to F. An advantage of the Bayesian approach is that 
having settled on a prior and horizon, the problem of finding a policy that 
minimises the Bayesian regret is just an optimisation problem. Most of this 
book is devoted to analyzing the ‘frequentist’ regret in Eq. (4.1), which does not 
integrate over all environments as Eq. (4.4) does. Bayesian methods are covered 
in Chapters 34 to 36, where we also discuss the strengths and weaknesses of the 
Bayesian approach. 


Decomposing the Regret 


We now present a lemma that forms the basis of almost every proof for 
stochastic bandits. Let v = (Pa : a € A) be a stochastic bandit and define 
A.(v) = u* (v) — Halv), which is called the suboptimality gap or action gap 
or immediate regret of action a. Further, let 


Talt) = XTA; = a} 


be the number of times action a was chosen by the learner after the end of round 
t. In general, Ta (n) is random, which may seem surprising if we think about a 
deterministic policy that chooses the same action for any fixed history. So why 
is Ta(n) random in this case? The reason is because for all rounds t except for 
the first, the action A; depends on the rewards observed in rounds 1,2,...,t— 1 
which are random, hence A; will also inherit their randomness. We are now ready 
to state the second and last lemmas of the chapter. In the statement of the lemma, 
we use our convention that the dependence of the various quantities involved on 


7 


the policy m and the environment v is suppressed. 


LEMMA 4.5 (Regret decomposition lemma). For any policy n and stochastic 
bandit environment v with A finite or countable and horizon n € N, the regret 
Rn of policy m in v satisfies 


Rn = 5 Aa 3 [Ta (n)] : (4.5) 


acA 


The lemma decomposes the regret in terms of the loss due to using each of the 
arms. It is useful because it tells us that to keep the regret small, the learner 
should try to minimise the weighted sum of expected action counts, where the 
weights are the respective suboptimality gaps, (Aa)acA. 


Lemma 4.5 tells us that a learner should aim to use an arm with a larger 
suboptimality gap proportionally fewer times. 


Note that the suboptimality gap for optimal arm(s) is zero. 
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Proof of Lemma 4.5 Since Ry, is based on summing over rounds, and the right- 
hand side of the lemma statement is based on summing over actions, to convert one 
sum into the other one, we introduce indicators. In particular, note that for any 
fixed t we have 5°. , 1{A; = a} = 1. Hence 5, = 3°, Xi = 5, 0, Xe {A: = a}, 
and thus 


Ry = ny" — E [Sa] = XO YOE [(H" — Xi {A = a}). (4.6) 


acA t=1 


The expected reward in round t conditioned on A; is 44,, which means that 


i [(u* — Xe) IT {Ag = a} | Ar] = 1{4 = a} E [u* — Xz | Ai] 
= {At = a} (u* — pa,) 

= I{A; = a} (uw — ma) 
=I{A,=a}A,. 


The result is completed by plugging this into Eq. (4.6) and using the definition 
of T,(n). 


The argument fails when A is uncountable because you cannot introduce the 
sum over actions. Of course the solution is to use an integral, but for this we need 
to assume (A,G) is a measurable space. Given a bandit v and policy m define 
measure G on (A, G) by 


? 


G(U) = yor eu} 


where the expectation is taken with respect to the measure on outcomes induced 
by the interaction of m and v. 


LEMMA 4.6. Provided that everything is well defined and appropriately measurable, 


X Aal = I Aa dG(a). 
t=1 A 


For those worried about how to ensure everything is well defined, see Section 4.7. 


Rn =E 


The Canonical Bandit Model (®) 


In most cases the underlying probability space that supports the random rewards 
and actions is never mentioned. Occasionally, however, it becomes convenient to 
choose a specific probability space, which we call the canonical bandit model. 
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Finite Horizon 

Let n € N be the horizon. A policy and bandit interact to produce the outcome, 
which is the tuple of random variables H, = (A1, X1,..., An, Xn). The first step 
towards constructing a probability space that carries these random variables 
is to choose the measurable space. For each t € [n], let Q; = ([k] x R)’ c R” 
and F; = 8(Q;). The random variables A1, X1,..., An, Xn that make up the 
outcome are defined by their coordinate projections: 


Ailai, 21,---,;Qn,Ln) = a and Xilat Tis ea Ans ri) = Tt- 


The probability measure on (Qn, Fn) depends on both the environment and the 
policy. Our informal definition of a policy is not quite sufficient now. 


DEFINITION 4.7. A policy 7 is a sequence (m:)f—;, where 7; is a probability 
kernel from (Q41, F;—1) to ([k], 2"). Since [k] is discrete, we adopt the notational 
convention that for i € [k], 


(i | Q1,71,.-- e154) = mihi} | Q1, Či; sity a Tt—1) * 


Let v = (P;){_; be a stochastic bandit where each P; is a probability measure 
on (R, B(R)). We want to define a probability measure on (Qn, Fn) that respects 
our understanding of the sequential nature of the interaction between the learner 
and a stationary stochastic bandit. Since we only care about the law of the 
random variables (X+) and (A+), the easiest way to enforce this is to directly list 
our expectations, which are 


(a) the conditional distribution of action A; given A1, X1,...,A¢—1, X¢_1 is 
mil- | A1, X1,..., A¢—-1, X¢-1) almost surely. 

(b) the conditional distribution of reward X; given A1, X1,..., A is Pa, almost 
surely. 


The sufficiency of these assumptions is asserted by the following proposition, 
which we ask you to prove in Exercise 4.2. 


PROPOSITION 4.8. Suppose that P and Q are probability measures on an arbitrary 
measurable space (NQ, F) and A1, X1,..., An, Xn are random variables on Q, where 
A, € [k] and X; € R. If both P and Q satisfy (a) and (b), then the law of the 
outcome (A1, X1,..., An, Xn) under P is the same as under Q: 


ae ote Pe = QA, Xis- An: Xn . 


Next we construct a probability measure on (Qn, Fn) that satisfies (a) and 
(b). To emphasise that what follows is intuitively not complicated, imagine that 
X+ € {0,1} is Bernoulli, which means the set of possible outcomes is finite and 
we can define the measure in terms of a distribution. Let p;(0) = P;({0}) and 
pi(1) = 1 — p;(0) and define 


Pun (a1, £1,- - - , An, Zn) = J [ra | a1, £1,... , Qt—1, Ve-1) Pa, (£t). 
t=1 
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The reader can check that pyr is a distribution on ([k] x {0,1})” and that the 
associated measure satisfies (a) and (b) above. Making this argument rigourous 
when (P;) are not discrete requires the use of Radon—Nikodym derivatives. Let A 
be a o-finite measure on (R,8(R)) for which P; is absolutely continuous with 
respect to A for all i. Next, let p; = dP;/d\ be the Radon—Nikodym derivative of 
P; with respect to A, which is a function p; : R — R such that Ja pi dà = P;(B) 
for all B € B(R). Letting p be the counting measure with p(B) = |B|, the density 
Dun : Q — R can now be defined with respect to the product measure (p x A)” 
by 


n 


Pvz (Q1, £1, eey iia, 2a) = II 1 (at | a1, 21, Sa Qt—1, £t-1) Pay (xt) " (4.7) 
t=1 


The reader can again check (more abstractly) that (a) and (b) are satisfied by 
the probability measure P,,, defined by 


Pir(B) = f pow x A)" (dw) for all B € Fn. 


It is important to emphasise that this choice of (Qn, Fn, Pur) is not unique. Instead, 
all that this shows is that a suitable probability space does exist. Furthermore, if 
some quantity of interest depends on the law of Hn, by Proposition 4.8, there is 
no loss in generality in choosing (Qn, Fn, Pyr) as the probability space. 


A choice of À such that P; < A for all i always exists since À = T i 
satisfies this condition. For direct calculations, another choice is usually 
more convenient, e.g. the counting measure when (P;) are discrete and the 
Lebesgue measure for continuous (P;). 


There is another way to define the probability space, which can be useful. 
Define a collection of independent random variables (Xsi)sefn] c[k] Such that the 
law of Xn is Pi. By Theorem 2.4 these random variables may be defined on 
(Q, F), where Q = R™ and F = B(R”*). Then let X; = Xt4,, where the actions 
A; are F;_1-measurable with Fy, = o( A1, X1,..., A¢-1, Xt-1). We call this the 
random table model. Yet another way is to define (X.;)5,, as above but let 
Xe = XT 4, (2), Ae This corresponds to sampling a stack of rewards for each arm 
at the beginning of the game, giving rise to the reward-stack model. Each time 
the learner chooses an action, they receive the reward on top of the stack. All of 
these models are convenient from time to time. The important thing is that it 
does not matter which model we choose because the quantity of ultimate interest 
(usually the regret) only depends on the law of A1, Xj,..., An, Xn, and this is 
the same for all choices (Exercise 4.4). 
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Infinite Horizon 

We never need the canonical bandit model for the case that n = oo. It is comforting 
to know, however, that there does exist a probability space (Q, F, Pyr) and infinite 
sequences of random variables X1, X2,... and A1, A2,... satisfying (a) and (b). 
The result follows directly from the theorem of Ionescu-Tulcea (Theorem 3.3). 


The Canonical Bandit Model for Uncountable Action Sets (Œ) 


For uncountable action sets, a little more machinery is necessary to make things 
rigourous. The first requirement is that the action set must be a measurable 
space (A, G) and the collection of distribution v = (P, : a € A) that defines a 
bandit environment must be a probability kernel from (.4,G) to (R,B(R)). A 
policy is a sequence (7;)/_,, where 7; is a probability kernel from (Q4-1, Fi—1) 
to (A,G) with 
t t 
%=][UxR) and A=@QGoexR)). 
=l =l 

The canonical bandit model is the probability measure P,, on (Qn, Fn) 
obtained by taking the product of the probability kernels 71, Pi,- .. nTn, Pa and 
using Ionescu-Tulcea (Theorem 3.3), where P, is the probability kernel from 
(Q4-1 x A, F; @G) to (R, B(R)) given by P;(- | a1, £1,. .. , @t-1, 4-1, Gt) = Pa,(-). 


We did not define Pyy in terms of a density because there may not exist a 
common dominating measure for either (P, : a € A) or the policy. When 
such measures exist, as they usually do, then Pyr may be defined in terms 
of a density in the same manner as the previous section. 


You will check in Exercise 4.6 that the assumptions on v and ~ in this section 
are sufficient to ensure the quantities in Lemma 4.6 are well defined and that 
Proposition 4.8 continues to hold in this setting without modification. Finally, in 
none of the definitions above do we require that n be finite. 


Notes 


1 It is not obvious why the expected value is a good summary of the reward 
distribution. Decision makers who base their decisions on expected values are 
called risk-neutral. In the example shown on the figure above, a risk-averse 
decision maker may actually prefer the distribution labelled as A because 
occasionally distribution B may incur a very small (even negative) reward. 
Risk-seeking decision makers, if they exist at all, would prefer distributions 
with occasional large rewards to distributions that give mediocre rewards only. 
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There is a formal theory of what makes a decision maker rational (a decision 
maker in a nutshell is rational if they do not contradict themself). Rational 
decision makers compare stochastic alternatives based on the alternatives’ 
expected utilities, according to the von-Neumann—Morgenstern utility theorem. 
Humans are known not to do this. We are irrational. No surprise here. 

The study of utility and risk has a long history, going right back to (at least) 
the beginning of probability [Bernoulli, 1954, translated from the original 
Latin, 1738]. The research can broadly be categorised into two branches. The 
first deals with describing how people actually make choices (descriptive 
theories), while the second is devoted to characterising how a rational decision 
maker should make decisions (prescriptive theories). A notable example 
of the former type is ‘prospect theory’ [Kahneman and Tversky, 1979], which 
models how people handle probabilities (especially small ones) and earned 
Daniel Kahneman a Nobel Prize (after the death of his long-time collaborator, 
Amos Tversky). Further descriptive theories concerned with alternative aspects 
of human decision-making include bounded rationality, choice strategies, 
recognition-primed decision-making and image theory [Adelman, 2013]. 

The most famous example of a prescriptive theory is the von Neumann- 
Morgenstern expected utility theorem, which states that under (reasonable) 
axioms of rational behaviour under uncertainty, a rational decision maker 
must choose amongst alternatives by computing the expected utility of the 
outcomes [Neumann and Morgenstern, 1944]. Thus, rational decision makers, 
under the chosen axioms, differ only in terms of how they assign utility to 
outcomes (i.e. rewards). Finance is another field where attitudes towards 
uncertainty and risk are important. Markowitz [1952] argues against expected 
return as a reasonable metric that investors would use. His argument is based 
on the (simple) observation that portfolios maximising expected returns will 
tend to have a single stock only (unless there are multiple stocks with equal 
expected returns, a rather unlikely outcome). He argues that such a complete 
lack of diversification is unreasonable. He then proposes that investors should 
minimise the variance of the portfolio’s return subject to a constraint on the 
portfolio’s expected return, leading to the so-called mean-variance optimal 
portfolio choice theory. Under this criteria, portfolios will indeed tend to 
be diversified (and in a meaningful way: correlations between returns are taken 
into account). This theory eventually won him a Nobel Prize in economics 
(shared with two others). Closely related to the mean-variance criterion are the 
‘value-at-risk’ (VaR) and the ‘conditional value-at-risk’, the latter of which has 
been introduced and promoted by Rockafellar and Uryasev [2000] due to its 
superior optimisation properties. The distinction between the prescriptive and 
descriptive theories is important: human decision makers are in many ways 
violating rules of rationality in their attitudes towards risk. 

We defined the regret as an expectation, which makes it unusable in conjunction 
with measures of risk because the randomness has been eliminated by the 
expectation. When using a risk measure in a bandit setting, we can either base 
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this on the random regret or pseudo-regret defined by 


Ên = np* — >» Xi. (random regret) 
t=1 

Rn = np* — 5 UA. (pseudo-regret) 
t=1 


While R,, is influenced by the noise X, — p A, in the rewards, the pseudo-regret 
filters this out, which arguably makes it a better basis for measuring the ‘skill’ 
of a bandit policy. As these random regret measures tend to be highly skewed, 
using variance to assess risk suffers not only from the problem of penalising 
upside risk, but also from failing to capture the skew of the distribution. 

5 What happens if the distributions of the arms are changing with time? 
Such bandits are unimaginatively called non-stationary bandits. With no 
assumptions, there is not much to be done. Because of this, it is usual to 
assume the distributions change infrequently or drift slowly. We’ll eventually 
see that techniques for stationary bandits can be adapted to this set-up (see 
Chapter 31). 

6 The rigourous models introduced in Sections 4.6 and 4.7 are easily extended to 
more sophisticated settings. For example, the environment sometimes produces 
side information as well as rewards or the set of available actions may change 
with time. You are asked to formalise an example in Exercise 4.7. 


Bibliographical Remarks 


There is now a huge literature on stochastic bandits, much of which we will 
discuss in detail in the chapters that follow. The earliest reference that we know 
of is by Thompson [1933], who proposed an algorithm that forms the basis 
of many of the currently practical approaches in use today. Thompson was a 
pathologist who published broadly and apparently did not pursue bandits much 
further. Sadly his approach was not widely circulated, and the algorithm (now 
called Thompson sampling) did not become popular until very recently. Two 
decades after Thompson, the bandit problem was formally restated in a short but 
influential paper by Robbins [1952], an American statistician now most famous 
for his work on empirical Bayes. Robbins introduced the notion of regret and 
minimax regret in his 1952 paper. The regret decomposition (Lemma 4.5) has 
been used in practically every work on stochastic bandits, and its origin is hard 
to pinpoint. All we can say for sure is that it does not appear in the paper by 
Robbins [1952], but does appear in the work of Lai and Robbins [1985]. Denardo 
et al. [2007] considers risk in a (complicated) Bayesian setting. Sani et al. [2012] 
consider a mean-variance approach to risk, while Maillard [2013] considers so- 
called coherence risk measures (CVaR, is one example of such a risk measure), 
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and with an approach where the regret itself is redefined. VaR is considered in 
the context of a specific bandit policy family by Audibert et al. [2007, 2009]. 


Exercises 


4.1 (POSITIVITY OF THE REGRET) Prove Lemma 4.4. 
4.2 (UNIQUENESS OF LAW) Prove Proposition 4.8. 


4.3 (DEFINITION OF CANONICAL PROBABILITY MEASURE) Prove that the measure 
defined in terms of the density in Eq. (4.7) satisfies the conditions (a) and (b) 
in Section 4.6. 


HINT Use the properties of the Radon—Nikodym derivative in combination with 
Fubini’s theorem. 


4.4 (RANDOM TABLE VERSUS STACKED-REWARD MODELS) Show that both the 
random-table and the stacked-reward models give rise to a probability distribution 
that satisfy the condition (b) in Section 4.6. 


Hint When reasoning about the stacked-reward model, use Doob’s optional 
stopping theorem (Theorem 3.8), which continues to hold if N in the theorem 
is replaced by [|t] for some t € N+ assuming that r < t (which also means that 
Condition (a) of Doob’s theorem is automatically satisifed). 


4.5 (MIXING POLICIES) Fix a horizon n and k. Let II be a finite set of policies 
for k-armed bandits on horizon n and p € P(II) be a distribution over II. Show 
there exists a policy 7° such that for any k-armed stochastic bandit v, 


Pyro = > PO Pye 


well 


Proof For action/reward sequence a1,21,...,@n,2n, syntactically abbreviate 
hy = a1, £1,- . - , Q4, £4. Then define 
t 
Xren p(T) I= Ts(as |hs—1) 
t1 : 
Xren p(T) I= Ts(as |hs—1) 


By the definition of the canonical probability space and the product of probability 
kernels, 


Te (ar | hi1) = 


k k 


Pyre(B)= X. f [Bo 0n)van (den) lan | ya) ++ Yn (rs) (a) 
k k 

=o D fo Df tattoo (tenra lan a) va (deim lan) 

= X p(r)Pyx(B) , 
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where the second equality follows by substituting the definition of 7° and 
induction. 0O 


4.6 (REGRET DECOMPOSITION AND CANONICAL MODEL FOR LARGE ACTION 
SPACES) Let v be a bandit on measurable action space (A, G) and 71,...,7», be 
a policy satisfying the conditions in Section 4.7. 


(a) Show that all quantities in Lemma 4.6 are appropriately defined and 
measurable. 

(b) Prove Lemma 4.6. 

(c) Prove that Proposition 4.8 continues to hold. 


4.7 (CANONICAL MODEL FOR CONTEXTUAL BANDIT) Let A and C be finite sets. 
A stochastic contextual bandit is like a normal stochastic bandit, but in each 
round the learner first observes a context C+ € C. They then choose an action 
A; € A and receive a reward X; ~ Pa, c,. 


(a) Suppose that C1,...,Cn is sampled independently from distribution 
€ on C. Construct the canonical probability space that carries 
C1, Ai, X1,---,Cn, An, Xn. 

(b) What changes when C; is allowed to depend on C1, A1, X1,...,C¢—-1, At_1, Xt-1? 


4.8 (BERNOULLI ENVIRONMENT IMPLEMENTATION) Implement a Bernoulli bandit 
environment in Python using the code snippet below (or adapt to your favourite 
language). 


class BernoulliBandit: 
# accepts a list of K >= 2 floats, each lying in [0,1] 
def __init__(self, means): 
pass 


# Function should return the number of arms 
def K(self): 
pass 


# Accepts a parameter O <= a <= K-1 and returns the 
# realisation of random variable X with P(X = 1) being 
# the mean of the (ati1)th arm. 
def pull(self, a): 
pass 


# Returns the regret incurred so far. 
def regret(self): 
pass 


4.9 (FOLLOW-THE-LEADER IMPLEMENTATION) Implement the following simple 
algorithm called ‘follow-the-leader’, which chooses each action once and 
subsequently chooses the action with the largest average observed so far. Ties 
should be broken randomly. 


|| def FollowTheLeader (bandit, n): 


= 
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# implement the Follow-the-Leader algorithm by replacing 
# the code below that just plays the first arm in every round 
for t in range(n): 

bandit. pull (0) 


Depending on the literature you are reading, follow-the-leader may be called 
‘stay with the winner’ or the ‘greedy algorithm’. 


4.10 Suppose v is a finite-armed stochastic bandit and 7 is a policy such that 


R 
tim PADU _ 9. 
n—- Co n 
Let T* (n) = X; Iua, = u*} be the number of times an optimal arm is chosen. 


Prove or disprove each of the following statements: 


(a) limno E[T*(n)]/n = 1. 
(b) limp+oP(Ay, > 0) =0. 


4.11 (ONE-ARMED BANDITS) Let M; be a set of distributions on (R, B(R)) with 
finite means and Mə = {6,,} be the singleton set with a Dirac at u2 € R. The 
set of bandits € = Mı x Mə is called a one-armed bandit because, although 
there are two arms, the second arm always yields a known reward of u2. A policy 
mw = (mi) is called a retirement policy if once action 2 has been played once, 
it is played until the end of the game. Precisely, if a; = 2, then 


ayia (2 | Mig £1,- -, at, £4) = 1 for all (as)tZ} and (z)t. 


(a) Let n be fixed and 7 = (7¢)?_, be any policy. Prove there exists a retirement 
policy a’ = (})?_, such that for all v € E. 


Rn(t',v) < Rahm, v). 


(b) Let Mı = {B(u1) : mı € [0,1]} and suppose that m = (mz)?2, is a retirement 
policy. Prove there exists a bandit v € E such that 


R,(7,v) 


lim sup >0. 


n— Co 
4.12 (FAILURE OF FOLLOW-THE-LEADER, (I)) Consider a Bernoulli bandit with 
two arms and means py = 0.5 and u2 = 0.6. 


(a) Using a horizon of n = 100, run 1000 simulations of your implementation 
of follow-the-leader on the Bernoulli bandit above and record the (random) 
pseudo regret, nu* — S>/"_, HA., in each simulation. 

(b) Plot the results using a histogram. Your figure should resemble Fig. 4.2. 

(c) Explain the results in the figure. 
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Figure 4.2 Histogram of regret for follow-the-leader over 1000 trials on a Bernoulli bandit 
with means uı = 0.5, y2 = 0.6 


4.13 (FAILURE OF FOLLOW-THE-LEADER (II)) Consider the same Bernoulli 
bandit as used in the previous question. 


(a) Run 1000 simulations of your implementation of follow-the-leader for each 
horizon n € {100, 200, 300, . . . , 1000}. 

(b) Plot the average regret obtained as a function of n (see Fig. 4.3). Because the 
average regret is an estimator of the expected regret, you should generally 
include error bars to indicate the uncertainty in the estimation. 

(c) Explain the plot. Do you think follow-the-leader is a good algorithm? 
Why/why not? 
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Figure 4.3 The regret for Follow-the-leader over 1000 trials on Bernoulli bandit with 
means pı = 0.5, u2 = 0.6 and horizons ranging from n = 100 to n = 1000. 


5.1 


Concentration of Measure 


Before we can start designing and analysing algorithms, we need one more tool 
from probability theory, called concentration of measure. Recall that the 
optimal action is the one with the largest mean. Since the mean pay-offs are 
initially unknown, they must be learned from data. How long does it take to 
learn about the mean reward of an action? In this section, after introducing 
the notion of tail probabilities, we look at ways of obtaining upper bounds on 
them. The main point is to introduce subgaussian random variables and the 
Cramér—Chernoff exponential tail inequalities, which will play a central role in 
the design and analysis of the various bandit algorithms. 


Tail Probabilities 


Suppose that X, X1, X2,..., Xn is a sequence of independent and identically 
distributed random variables, and assume that the mean u = E[X] and variance 
a? = V[X] exist. Having observed X1, X2,...,Xn, we would like to estimate the 
common mean u. The most natural estimator is 


p= yok, 


i=l 


which is called the sample mean or empirical mean. Linearity of expectation 
(Proposition 2.6) shows that EJA] = u, which means that fi is an unbiased 
estimator of u. How far from u do we expect fi to be? A simple measure 
of the spread of the distribution of a random variable Z is its variance, 
V [Z] = E [(Z — E [Z])?]. A quick calculation using independence shows that 


VIA =E [K-a], (5.1) 


which means that we expect the squared distance between u and ji to shrink as 
n grows large at a rate of 1/n and scale linearly with the variance of X. While 
the expected squared error is important, it does not tell us very much about the 
distribution of the error. To do this we usually analyse the probability that ĝ 
overestimates or underestimates u by more than some value € > 0. Precisely, how 
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P(X<M-€) K er"? 
> > = 


Mme M M+ 


Figure 5.1 The figure shows a probability density, with the tails shaded indicating the 
regions where X is at least € away from the mean u. 


do the following quantities depend on €? 
P@@>ute) and P(i<p—e). 


The expressions above (as a function of £) are called the tail probabilities of 
fi— u (Fig. 5.1). Specifically, the first is called the upper tail probability and the 
second the lower tail probability. Analogously, P (|A — u| > £) is called a two-sided 
tail probability. 


The Inequalities of Markov and Chebyshev 


The most straightforward way to bound the tails is by using Chebyshev’s 
inequality, which is itself a corollary of Markov’s inequality. The latter is 
one of the golden hammers of probability theory, and so we include it for the 
sake of completeness. 


LEMMA 5.1. For any random variable X and £ > 0, the following holds: 


(a) (Markov): P (|X| > £) < 2 
(b) (Chebyshev): P (|X — E [X] | > £) < 


V[X] 
ez 


We leave the proof of Lemma 5.1 as an exercise for the reader. By combining 
(5.1) with Chebyshev’s inequality, we can bound the two-sided tail directly in 
terms of the variance by 


g2 


P(\fi—pl >) < (5.2) 


ne? 
This result is nice because it was so easily bought and relied on no assumptions 
other than the existence of the mean and variance. The downside is that when X is 
well behaved, the inequality is rather loose. By assuming that higher moments of 
X exist, Chebyshev’s inequality can be improved by applying Markov’s inequality 
to |ĝ— u|}, with the positive integer k to be chosen so that the resulting bound is 
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optimised. This is a bit cumbersome, and thus instead we present the continuous 
analog of this, known as the Cramér-Chernoff method. 

To calibrate our expectations on what improvement to expect relative to 
Chebyshev’s inequality, let us start by recalling the central limit theorem 
(CLT). Let Sn = (X: — u). The CLT says that under no additional 
assumptions than the existence of the variance, the limiting distribution of 
S,/Vno2 as n — oo is a Gaussian with mean zero and unit variance. If 
Z ~N(0,1), then 


TOn ew (-) ae 


The integral has no closed-form solution, but is easy to bound: 


L melrea r) 
=U m L — T CX a XL 
u 20 p 2 T uy 2r u E 2 


a a as 


which gives 


P(i@> pre) =P (S,/Vo?n > eV/njo?) = P (Z > eyna?) 
S V x exp ( =) ` a 


This always improves on what we obtained with Chebyshev’s inequality, usually 
by an enormous margin (Exercise 5.3). In particular, the bound on the right-hand 
side of (5.4) decays slightly faster than the negative exponential of ne? /(20°), 
which means that fi rapidly concentrates around its mean. 


An oft-taught rule of thumb is that the CLT provides a reasonable 
approximation for n > 30. We advise caution. Suppose that X1,..., Xn 
are independent Bernoulli with bias p = 1/n. As n tends to infinity the 
distribution of )>/'_, Xs converges to a Poisson distribution with parameter 
1, which does not look Gaussian at all. 


The asymptotic nature of the CLT makes it unsuitable for designing bandit 
algorithms. In the next section, we derive finite-time analogs, which are only 
possible by making additional assumptions. 
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The Cramér-Chernoff Method and Subgaussian Random 
Variables 


For the sake of moving rapidly towards bandits, we start with a straightforward 
and relatively fundamental assumption on the distribution of X, known as the 
subgaussian assumption. 


DEFINITION 5.2 (Subgaussianity). A random variable X is o-subgaussian if for 
all A € R, it holds that E [exp(.X)] < exp (A?0?/2). 


An alternative way to express the subgaussianity condition uses the moment- 
generating function of X, which is a function Mx : R — R defined by 
Mx(A) = Efexp(AX)]. The condition in the definition can be written as 


1 
wx (A) = log Mx(A) < ao for all AER. 


The function Yx is called the cumulant-generating function. It is not hard 
to see that Mx (or Yx) need not exist for all random variables over the whole 
range of real numbers. For example, if X is exponentially distributed and A > 1, 
then 


i fexp(AX)] = | exp(—2) x exp(Axr)dx = 00. 
N 
? density of exponential 
The moment-generating function of X ~ M (0, o?) satisfies Mx (A) = exp(A?a7/2), 
and so X is o-subgaussian. 


A random variable X is heavy tailed if Mx(A) = oo for all A > 0. Otherwise 
it is light tailed. 


The following theorem explains the origin of the term ‘subgaussian’. The tails 
of a o-subgaussian random variable decay approximately as fast as that of a 
Gaussian with zero mean and the same variance. 


THEOREM 5.3. If X is o-subgaussian, then for any ¢ > 0, 


P(X > €) < exp (-=) (5.5) 


Proof We take a generic approach called the Cramér—Chernoff method. Let 


A > 0 be some constant to be tuned later. Then 


P(X > €) = P (exp (AX) > exp (Xe) 
< E [exp (AX)] exp (—A¢) (Markov’s inequality) 


242 
< exp (> > — ae) ; (Def. of subgaussianity) 


Choosing À = ¢/o? completes the proof. 
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A similar inequality holds for the left tail. By using the union bound 


P(AUB) < P(A) + P(B), we also find that P(|X| > £) < 2exp(—e?/(207)). 


An equivalent form of these bounds is 


P(X > 207 log(1/8)) < P (|X| > 207 log(2/3)) < ô. 


This form is often more convenient and especially the latter, which for small 6 
shows that with overwhelming probability X takes values in the interval 


(- v2 log(2/ô), \/202 log(2/0)) 


To study the tail behaviour of / — u, we need one more lemma. 


LEMMA 5.4. Suppose that X is o-subgaussian and Xı and Xə are independent 
and cı and o2-subgaussian, respectively, then: 


(a) E[X] =0 and V [X] < o?. 
(b) cX is |\clo-subgaussian for all c € R. 
(c) Xi +X is yo? + o$-subgaussian. 


The proof of the lemma is left to the reader (Exercise 5.7). Combining 
Lemma 5.4 and Theorem 5.3 leads to a straightforward bound on the tails 
of ji — u. 

COROLLARY 5.5. Assume that Xi — u are independent, o-subgaussian random 
variables. Then for any £ > 0, 
2 


E E? 
P (R> p+ e) <exp(—25) and P(S pe) < ep (55) > 


A ï n 
where ft = z J1 Xt- 


Proof By Lemma 5.4, it holds that f—p = X; (Xi—p)/n is o /yn-subgaussian. 
Then apply Theorem 5.3. 


For x > 0, it holds that exp(—x) < 1/(ex), which shows that the above 
inequality is stronger than what we obtained via Chebyshev’s inequality except 
when € is very small. It is exponentially smaller if ne? is large relative to o2. The 
deviation form of the above result says that under the conditions of the result, 


for any ô € [0, 1], with probability at least 1 — ô, 
20? log(1/ô) 


HSÂṣ . (5.6) 
n 


Symmetrically, it also follows that with probability at least 1 — ô, 


uzi 20? log(1/6) l (5.7) 


n 


Again, one can use a union bound to derive a two-sided inequality. 


EXAMPLE 5.6. The following random variables are subgaussian: 
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(a) If X is Gaussian with mean zero and variance o°, then X is o-subgaussian. 
? 


(b) If X has mean zero and |X| < B almost surely for B > 0, then X is 


B-subgaussian. 


(c) If X has mean zero and X € [a,b] almost surely, then X is (b — a)/2- 


subgaussian. 


If X is exponentially distributed with rate \ > 0, then X is not o-subgaussian 
for any o ER. 


For random variables that are not centred (E [X] 4 0), we abuse notation 
by saying that X is o-subgaussian if the noise X — E [X] is o-subgaussian. 
A distribution is called o-subgaussian if a random variable drawn from that 
distribution is o-subgaussian. Subgaussianity is really a property of both a 
random variable and the measure on the space on which it is defined, so the 
nomenclature is doubly abused. 


Notes 


1 The Berry—Esseen theorem (independently discovered by Berry [1941] and 


Esseen [1942]) quantifies the speed of convergence in the CLT. It essentially 
says that the distance between the Gaussian and the actual distribution decays 
at a rate of 1/,/n under some mild assumptions (see Exercise 5.5). This is 
known to be tight for the class of probability distributions that appear in the 
Berry—Esseen result. However, this is a vacuous result when the tail probabilities 
themselves are much smaller than 1/./n. Hence the need for concrete finite-time 
results. 


Theorem 5.3 shows that subgaussian random variables have tails that decay 
almost as fast as a Gaussian. A version of the converse is also possible. That 
is, if a centered random has tails that behave in a similar way to a Gaussian, 
then it is subgaussian. In particular, the following holds: let X be a centered 
random variable (E[X] = 0) with P (|X| > €) < 2exp(—e?/2). Then X is 


w 


Aa 


Or 
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/5-subgaussian: 


: ae ee 2 | LEP 
ilexp(AX)] = E È A ziy J j | i | 
i=0 i i=2 


La fS jli 
<1+ > P (x > Eei) dz (Exercise 2.20) 
i=2 70 
© po 42/iņ2/i 
<1+ >| exp CE) dx (by assumption) 
=1+v2TÀ (e072) (1 + erf (>) — 1) (by Mathematica) 


(5) 
< exp 2 ë 


This bound is surely loose. At the same time, there is little room for 
improvement: if X has density p(x) = |x| exp(—2x?/2)/2, then P (|X| > £) = 
exp(—e?/2). And yet X is at best v2-subgaussian, so some degree of slack is 
required (see Exercise 5.4). 


We saw in (5.4) that if X1, X2,..., Xn are independent standard Gaussian 
random variables and fi = + 37?_,, then 


2 2 
P(A ze) sy exp ( E 


2rne? 202 


If ne?/o? is relatively large, then this bound is marginally stronger than 
exp(—ne?/(20°)), which follows from the subgaussian analysis. One might ask 
whether or not a similar improvement is possible more generally. And Talagrand 
[1995] will tell you: yes! At least for bounded random variables (details in the 
paper). 

Hoeffding’s lemma states that for a zero-mean random variable X such that 
X € [a,b] almost surely for real values a < b, then Mx(A) < exp(à?(b — a)? /8). 
Applying the Cramér—Chernoff method shows that if X 1, X2,..., Xn are 
independent and X; € [az, b+] almost surely with a; < b+ for all t, then 


p (: wes — E[X]) > e) < exp (=) . (5.8) 


t=1 


The above is called Hoeffding’s inequality. For details see Exercise 5.11. 
There are many variants of this result that provide tighter bounds when X 
satisfies certain additional distributional properties like small variance (see 
Exercise 5.14). 

The Cramér-Chernoff method is applicable beyond the subgaussian case, even 
when the moment-generating function is not defined globally. One example 
where this occurs is when X1, X2,..., Xn are independent standard Gaussian 
and Y = }`;_; X?. Then Y has a y?-distribution with n degrees of freedom. 
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An easy calculation shows that My(A) = (1 — 2A)~"/? for A € [0,1/2) and 
My (A) is undefined for A > 1/2. By the Cramér—Chernoff method, we have 
P(Y > < inf MAY —À 
(Vente) < inf, Ma(¥)exp(-Mn+.)) 
b 


1 
< i — — 
T yes) G = x) nee) 


1 (which minimizes the right-hand side) leads to 


37 TaT 
P(Y >n+e)< (1 + £) 2 exp (-§), which turns out to be about the best you 
can do [Laurent and Massart, 2000]. 

The subgaussian concept provides a large class of distributions for which 
concentration is easily analysed. As mentioned, however, many distributions 
are not subgaussian, like the exponential and y?-distribution. There are other 
general notions based on bounds on the moment generating function that 
generalise these kinds of distributions. For more on these ideas, you should 
look for keywords subexponential and subgamma. 


Choosing À = 


aD 


Bibliographical Remarks 


We return to concentration of measure many times, but note here that it is an 
interesting (and still active) topic of research. What we have seen is only the tip 
of the iceberg. Readers who want to learn more about this exciting field might 
enjoy the book by Boucheron et al. [2013]. For matrix versions of many standard 
results, there is a recent book by Tropp [2015]. The survey of McDiarmid [1998] 
has many of the classic results. There is a useful type of concentration bound 
that are ‘self-normalised’ by the variance. A nice book on this is by de la Pena 
et al. [2008]. Another tool that is occasionally useful for deriving concentration 
bounds in more unusual set-ups is called empirical process theory. There are 
several references for this, including those by van de Geer [2000] or Dudley [2014]. 


Exercises 


There are too many candidate exercises to list. We heartily recommend all the 
exercises in chapter 2 of the book by Boucheron et al. [2013]. 


5.1 (VARIANCE OF AVERAGE) Let X1, X2,..., Xn be a sequence of independent 
and identically distributed random variables with mean p and variance a? < oo. 
Let ñ = 4+ 07, X and show that V[f] = E[(@— )?] = 0?/n. 


5.2 (MARKOV’S INEQUALITY) Prove Markov’s inequality (Lemma 5.1). 


5.3 Prove that the Gaussian tail probability bound on the right-hand side 
of Eq. (5.4) is smaller than the bound obtained with Chebyshev’s inequality 
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Eq. (5.2). In what regime is the improvement most dramatic and in what regime 
are both bounds trivial? 


5.4 Let X be a random variable on R with density with respect to the Lebesgue 
measure of p(x) = |x| exp(—a?/2)/2. Show the following: 


(a) P(|X| > £) = exp(—e?/2). 
(b) X is not ,/(2 — ¢)-subgaussian for any € > 0. 
5.5 (BERRY—ESSEEN INEQUALITY) Let X1,Xo,...,Xn be a sequence of 


independent and identically distributed random variables with mean pu, variance 
g? and bounded third absolute moment: 


p= El|X1 — a’) < 00. 
Let Sn = X; (Xt — u)/o. The Berry—Esseen theorem shows that 


Sn 1 ii Cp 

P < —y?/2)dy| < == 
sup|P (52 <2) -= f exp(-v?/2)au] < E, 
te mm 


(x) 


where C < 1/2 is a universal constant. 


(a) Let fin = 4 Soy, X; and derive a tail bound from the Berry—Esseen theorem. 
That is, give a bound of the form P (fi, > u + £) for positive values of €. 

(b) Compare your bound with the one that can be obtained from the Cramér— 
Chernoff method. Argue pro- and contra- for the superiority of one over the 
other. 


5.6 (CENTRAL LIMIT THEOREM) We mentioned that invoking the CLT to 
approximate the distribution of sums of independent Bernoulli random variables 
using a Gaussian can be a bad idea. Let X1,..., Xn ~ B(p) be independent 
Bernoulli random variables with common mean p = pn = A/n, where à € (0,1). 
For x € N natural number, let P (a) = P(X, +- + Xn = 2). 


(a) Show that lim,_,., P,(x) = e~*A*/(z!), which is a Poisson distribution with 
parameter A. 

(b) Explain why this does not contradict the CLT, and discuss the implications 
of the Berry—Esseen. 

(c) In what way does this show that the CLT is indeed a poor approximation in 
some cases? 

(da) Based on Monte Carlo simulations, plot the distribution of X1 + +--+ Xn 
for n = 30 and some well-chosen values of A. Compare the distribution to 
what you would get from the CLT. What can you conclude? 


5.7 (PROPERTIES OF SUBGAUSSIAN RANDOM VARIABLES (1)) Prove Lemma 5.4. 


HINT Use Taylor series. 
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5.8 (PROPERTIES OF SUBGAUSSIAN RANDOM VARIABLES (I1)) Let X; be ci- 
subgaussian for 7 € {1,2} with o; > 0. Prove that X1+ X92 is (01 +02)-subgaussian. 


Do not assume independence of Xı and Xə. 


5.9 (PROPERTIES OF MOMENT/CUMULATIVE-GENERATING FUNCTIONS) Let X 
be a real-valued random variable and let Mx(A) = E[exp(\.X)] be its moment- 
generating function defined over dom( Mx) C R, where the expectation takes on 
finite values. Show that the following properties hold: 


(a) Mx is convex, and in particular dom( Mx) is an interval containing zero. 
(b) Mx(A) > eò] for all A € dom(Mx). 
(c) For any X in the interior of dom(Mx), Mx is infinitely many times 


(a) Let MP(A) = -Æ Mx(A). Then, for A in the interior of dom(Mx), 
M™® (A) = E [X* exp(AX)]. 


(e) Assuming 0 is in the interior of dom( Mx), M$ (0) = E [XF] (hence the 
name of Mx). 


(£) wx is convex (that is, Mx is log-convex). 
HINT For part (a), use the convexity of x +> e”. 


5.10 (LARGE DEVIATION THEORY) Let X, X1, X2,..., Xn be a sequence of 
independent and identically distributed random variables with zero mean and 
moment-generating function Mx with dom(Mx) =R. Let fin = 4+ Oy, Xt. 


(a) Show that for any € > 0, 


Llog P (fin = ©) < =Y% (e) = —sup (àe — log Mx(A). (5.9) 
n A 


(b) Show that when X is a Rademacher variable (P(X = —1) = P(X = 1) 
1/2), Y% (e) = +E log(1 +e) + 45£ log(1 — £) when |e] < 1 and y% (e) = +00, 
otherwise. 

(c) Show that when X is a centered Bernoulli random variable with parameter 
p (that is, P(X = —p) = 1 — p and P(X = 1 — p) = p) then PX (ce) = w 
when e€ is such that p+e > 1 and wX(e) = d(p + £, p) otherwise, where 


d(p,q) = plog(p/q) + (1 — p)log((1 — p)/(1 — q)) is the relative entropy 
between the distributions B(p) and B(q). 


(d) Show that when X ~ N(0,07) then y$ (e£) = e?/(207). 
(e) Let o? = V[X]. The (strong form of the) central limit theorem says that 


P (my 22) -0-20 =0. 


where ®(z) = Tz Jo. exp(—y?/2)dy is the cumulative distribution of the 


lim sup 
noo 2eER 


= 
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standard Gaussian. Let Z be a random variable distributed like a standard 
Gaussian. A careless application of this result might suggest that 


1 1 
lim = log P (fin > £) = lim Hog? (Z> e2) . 
noo n noo n fon 


Evaluate the right-hand side. In light of the previous parts, what can you 
conclude about the validity of the question-marked equality? What goes 
wrong with the careless application of the central limit theorem? What do 


you conclude about the accuracy of this theorem? 


HINT For Part ((e)), consider using Eq. (13.4). 


As it happens, the inequality in (5.9) may be replaced by an equality 
as n — co. The assumption that the moment-generating function exists 
everywhere may be relaxed significantly. We refer the interested reader to 
the classic text by Dembo and Zeitouni [2009]. The function Y% is called the 
Legendre transform, convex conjugate or Fenchel dual of the convex 
function Yx. In probability theory, Y% is also called the Cramér transform 
and is also known as a rate function. Convexity and the Fenchel dual will 
play a role in some of the later chapters and will be discussed in more detail 
in Chapter 26 and later. 


The name “large deviation” originates from rewriting the tail probabilities in 
terms of the partial sum Sp = X,+---+Xy, we see that the inequality in (5.9) 
bounds the probability of the deviation of S,, from its mean (which is zero by 
assumption) at a scale of O(n): P (fin > £) = P (Sn > ne). In contrast, the 
central-limit theorem (CLT) gives the (limiting) probability of the deviation 
of Sn from its mean at the scales of O(./n): P (fins/n > £) = P (Sn > yne). 
Compared to yne, ne is thought of as a “large” deviation. The deviation 
probabilities at this scale can decay to zero faster than what the CLT 
predicts, as also showcased in the last part of the last exercise. But what 
happens at intermediate scales? That is, when deviations are of size n%e with 
1/2 <a < n? This is studied on the formulaic name of moderate deviations. 
As it turns out, in this case, the ruthless use of the large deviation formula 
gives correct answers. The reader who wants to learn more about large 
deviation theory can check out the lecture notes by Swart [2017]. 


5.11 (HOEFFDING’S LEMMA) Suppose that X is zero mean and X € [a,b] almost 


surely for constants a < b. 


(a) Show that X is (b — a)/2-subgaussian. 
(b) Prove Hoeffding’s inequality (5.8). 
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Hint For part (a), it suffices to prove that qx (A) < A?(b— a)? /4. By Taylor’s 
theorem, for some A’ between 0 and A, Yx (A) = Yx (0) + Yy (OA + Y% (A’)A?/2. 
To bound the last term, introduce the distribution P, for A € R arbitrary: 
P, (dz) = e~?*™”e* P(dz). Show that U%(A) = V[Z], where Z ~ Py. Now, 
since Z € [a,b] with probability one, argue (without relying on E[Z]) that 
V[Z] < (b — a)? /4. 


5.12 (SUBGAUSSIANITY OF BERNOULLI DISTRIBUTION) Let X be a random 
variable with Bernoulli distribution with mean p. That is X ~ B(p): P(X = 1) = p 
and P(X = 0)=1-p. 


(a) Show that X is 1/2-subgaussian for all p. 


(b) Let Q : [0,1] > [0,1/2] be the function given by Q(p) = IUA 
where undefined points are defined in terms of their limits. Show that X is 
Q(p)-subgaussian. 

(c) The subgaussianity constant of a random variable X is the smallest value of 
c such that X is o-subgaussian. Show that the subgaussianity constant of 
X ~ B(p) is Q(p). 

(a) Plot Q(p) as a function of p. How does it compare to \/V[X] = V/p(1 — p)? 

(e) Show that for \ > 0 and p > 1/2, Eexp(AX) < exp(p(1 — p)A?/2). Think of 
how these inequalities are used for bounding tails. What do you conclude? 


Readers looking for a hint to parts (b), (c) and (e) in the previous exercise 
might like to look at the papers by Berend and Kontorovich [2013] and 
Ostrovsky and Sirota [2014]. The result that the subgaussianity constant of 
X ~ B(p) is upper bounded by Q(p) is known as the Kearn-Saul inequality 
and is due to Kearns and Saul [1998]. 


5.13 (CENTRAL LIMIT THEOREM FOR SUMS OF BERNOULLI RANDOM VARIABLES) 
In this question we try to understand the concentration of the empirical mean 
for Bernoulli random variables. Let X1, X2,..., Xn be independent Bernoulli 
random variables with mean p € [0,1] and fn = >}, Xi/n. Let Zn be normally 
distributed random variable with mean p and variance p(1 — p)/n. 


(a) Write down expressions for E[n] and V[p,]. 


(b) What does the central limit theorem say about the relationship between fn 
and Zn as n gets large? 

(c) For each p € {1/10, 1/2} and ô = 1/100 and A = 1/10, find the minimum n 
such that P (Pan > p+ A) < ô. 

(d) Let p = 1/10 and A = 1/10 and 


NBeal(ð, p, A) = min {n : P (n 2>p+A) <}, 
NGauss(d, p, A) = min {n : P (Zn >p+ A) < ô} : 
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(i) Evaluate analytically the value of 


nper(d, 1/10, 1/10) 
5-50 MGaues(5, 1/10, 1/10) ` 


(ii) In light of the central limit theorem, explain why the answer you got in (i) 
was not 1. 


Hint For Part (d.i) use large deviation theory (Exercise 5.10). 


5.14 (BERNSTEIN’S INEQUALITY) Let Xj,...,X, be a sequence of independent 
random variables with X; — E[X;] < b almost surely and S = X; (X: — E[X4]) 
and v = $; V[Xi]. 


(a) Show that g(x) =} + + zi +- = (exp(x) — 1 — x)/x? is increasing. 


(b) Let X be a random variable with E[X] = 0 and X < b almost surely. Show 
that Efexp(X)] < 1 + g(b)V[X]. 

(c) Prove that (1+ a)log(1 +a) -a > on for all a > 0. Prove that this is 
the best possible approximation in the sense that the 2 in the denominator 


cannot be increased. 


(d) Let £ > 0 and a = be/v and prove that 


P(S > €) < exp (-s (1+ a) log(1 + a) — a)) (5.10) 
e2 
< exp -z ) : (5.11) 


(e) Use the previous result to show that 


P (s> 2v log (5) + og (3) <6. 


(£) Let X1, X2,..., Xn be a sequence of random variables adapted to filtration 
F = (Fi). Abbreviate E|] = Ef- | Fi] and u = Er-1| X4]. Define S = 
ye, Xr- pu and let V = D>}, Er-1[(X:— u)’] be the predictable variation 
of (Xf; Xt — H)p. Show that if X; — p < b holds almost surely for all 
t € [n] then with a = be/v, 


UV 


P(S >e,V < v) < exp ( re 


(1+ a) log(1 + a) — a)) 


Note that the right-hand side of this inequality is the same as that shown in 
Eq. (5.10). 
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The bound in Eq. (5.10) is called Bennett’s inequality and the one 
in Eq. (5.11) is called Bernstein’s inequality. There are several 
generalisations, the most notable of which is the martingale version that 
slightly relaxes the independence assumption and which was presented in 
Part (£). Martingale techniques appear in Chapter 20. Another useful variant 
(under slightly different conditions) replaces the actual variance with the 
empirical variance. This is useful when the variance is unknown. For more, 
see the papers by Audibert et al. [2007], Mnih et al. [2008], Maurer and 
Pontil [2009]. 


5.15 (ANOTHER BERNSTEIN-TYPE INEQUALITY) Let X1,X2,...,Xn be 
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a 


sequence of random variables adapted to the filtration F = (F;);. Abbreviate 


ul] = E[-| Fe] and ut = Ey_1[X;]. Prove the following 


(a) If 7 > 0 and 7(X; — u) < 1 almost surely, then 


Z a. > 1 1 
p (Stam) 2932 be—1[(Xt — He)‘ k e (3)) <ô. 


t=1 


(b) If 7 > 0 and nX, < 1 almost surely, then 


n n i r 1 1 
p ($x m) 20), 2—1 [X7] E (3)) <ô. 


t= 


Hint Use the Cramér—Chernoff method and the fact that exp(x) < 1 + x +2? 


for all x < 1 and exp(x) > 1+ 2 for all z. 


Let (M;) be the martingale defined by M; = aCe — us). The inequalities 
in Exercise 5.15 can be viewed as a kind of Bernstein’s inequality because 
they bound the tail of the martingale (M+) in terms of the predictable 
variation of the martingale (M+), which is V = Xp; Ex-i[(X¢ — pt)’. 
The main difference relative to well-known results is that the analysis has 
stopped early. The next step is usually to choose 7 to minimise the bound 
in some sense. Either by assuming bounds on the predictable variation, 
union bounding or using the method of mixtures [de la Peña et al., 2008]. 
These techniques are covered in Chapter 20. Note, optimising 7 directly is 
not possible because the bounds hold for any fixed 7, but minimising the 
right-hand side inside the probability with respect to 7 would lead to a 
random 7. For more martingale results with this flavour, see the notes by 
McDiarmid [1998]. 


5.16 Let X),.. 


., Xn be independent random variables with P (X; < x) < a for 
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each x € [0,1] and t € [n]. Prove for any £ > 0 that 


P (Soe /%9 > e) < (£) exp(n— £). 


5.17 (SAMPLE MEAN CONCENTRATION FOR CATEGORICAL DISTRIBUTIONS) Let 
Xı,.-.., Xn be an independent and identically distributed sequence taking values 
in [m]. For i € [m], let p(t) = P(X, = i) and p(i) = +i 1{X; = i}. Show 
that for any 6 € (0,1), 


P | |lp— dla = V2 flog (5) ++ mlog(2)] <ô. (5.12) 


n 


HINT Use the fact that ||p — p||1 = maxye¢—1,1} (À, p — Ê). 


The distribution of np is known as the multinomial distribution. The 
inequality (5.12) appears in the appendix of the book by van der Vaart and 
Wellner [1996b], where it is called the Bretagnolle-Huber-Carol inequality. 
The appendix contains two other refinements of this inequality. 


5.18 (EXPECTATION OF MAXIMUM) Let X1,...,Xn be a sequence of ø- 
subgaussian random variables (possibly dependent) and Z = max;e[nj Xt. Prove 
that 


(a) E[Z] < \/207log(n). 


(b) P (z > 4/20? log(n/5)) < ô for any ô € (0,1). 


HINT Use Jensen’s inequality to show that exp(AE[Z]) < Eļexp(AZ)], and then 
provide a naive bound on the moment-generating function of Z. 


5.19 (ALMOST SURELY BOUNDED SUMS) Let X1, X2,..., Xn be a sequence of non- 
negative random variables adapted to filtration (F;)?_y such that Xp; X; <1 
almost surely. Prove that for all x > 1, 


n-1 
a2 UX, | F DELE (==) , ifa<n; 
0 


t=1 ; ife>n, 


where the equality serves as the definition of f,(x). 


Hint This problem does not use the techniques introduced in the chapter. 
Prove that Bernoulli random variables are the worst case and use backwards 
induction. Although this result is new to our knowledge, a weaker version was 
derived by Kirschner and Krause [2018] for the analysis of information-directed 
sampling. The bound is tight in the sense that there exists a sequence of random 
variables and filtration for which equality holds. 


Part Il 


Stochastic Bandits with 
Finitely Many Arms 
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Over the next few chapters, we introduce the fundamental algorithms and 
tools of analysis for unstructured stochastic bandits with finitely many actions. 
The keywords here are finite, unstructured and stochastic. The first of these 
just means that the number of actions available is finite. The second is more 
ambiguous, but roughly means that choosing one action yields no information 
about the mean pay-off of the other arms. A bandit is stochastic if the sequence 
of rewards associated with each action is independent and identically distributed 
according to some distribution. This latter assumption will be relaxed in Part III. 

There are several reasons to study this class of bandit problems. First, their 
simplicity makes them relatively easy to analyse and permits a deep understanding 
of the trade-off between exploration and exploitation. Second, many of the 
algorithms designed for finite-armed bandits, and the principle underlying them, 
can be generalised to other settings. Finally, finite-armed bandits already have 
applications — notably as a replacement to A/B testing, as discussed in the 
introduction. 


6.1 


The Explore- Then-Commit 
Algorithm 


The first bandit algorithm of the book is called explore-then-commit (ETC), 
which explores by playing each arm a fixed number of times and then exploits by 
committing to the arm that appeared best during exploration. 


For this chapter, as well as Chapters 7 to 9, we assume that all bandit 
instances are in €§.(1), which means the reward distribution for all arms is 
1-subgaussian. 


The focus on subgaussian distributions is mainly for simplicity. Many of the 
techniques in the chapters that follow can be applied to other stochastic bandits 
such as those listed in Table 4.1. The key difference is that new concentration 
analysis is required that exploits the different assumptions. The Bernoulli case is 
covered in Chapter 10, where other situations are discussed along with references 
to the literature. Notice that the subgaussian assumption restricts the subgaussian 
constant to g = 1, which saves us from endlessly writing ø. All results hold for 
other subgaussian constants by scaling the rewards (see Lemma 5.4). Two points 
are obscured by this simplification: 


(a) All the algorithms that follow rely on the knowledge of ø. 


(b) It may happen that P; is subgaussian for all arms, but with a different 
subgaussian constant for each arm. Algorithms are easily adapted to this 
situation if the subgaussian constants are known, as you will investigate 
in Exercise 7.2. The situation is more complicated when the subgaussian 
constant is unknown (Exercise 7.7). 


Algorithm and Regret Analysis 


ETC is characterised by the number of times it explores each arm, denoted by a 
natural number m. Because there are k actions, the algorithm will explore for mk 
rounds before choosing a single action for the remaining rounds. Let fi;(t) be the 
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average reward received from arm i after round t, which is written formally as 


f(t) = KO DHA =i Xe, 


where T;(t) = Xt; I {As = i} is the number of times action i has been played 


s=l1 


after round t. The ETC policy is given in Algorithm 1 below. 


1: Input m. 
2: In round t choose action 


Ag (tmodk) +1, if t < mk; 
= argmax, fij(mk), t > mk. 


(ties in the argmax are broken arbitrarily) 


Algorithm 1: Explore-then-commit. 


Recall that u; is the mean reward when playing action i and A; = y* — pu; is 
suboptimality gap between the mean of action 7 and the optimal action. 


THEOREM 6.1. When ETC is interacting with any 1-subgaussian bandit and 
l<m<n/k, 


k k mA? 
Ry <m 5 Ai + (n— mk) Ajexp (-=5) : 


i=1 i=1 
Proof Assume without loss of generality that the first arm is optimal, which 
means that uı = w* = max; Hi. By the decomposition given in Lemma 4.5, the 
regret can be written as 


k 
Rn = DS A.E[T;(n)] . (6.1) 


In the first mk rounds, the policy is deterministic, choosing each action exactly 
m times. Subsequently it chooses a single action maximising the average reward 
during exploration. Thus, 


l [T;(n)] = m + (n — mk)P (Amis = i) 


< m+ (n — mk)P (‘stn > m fy (mk) ; (6.2) 


The probability on the right-hand side is bounded by 


P (i(k) > max j(mnk) ) < P (a(k) > fank) 
= P (fi(mk) — m — (ĝi (mk) — m) = Aj) 


The next step is to check that fi;(mk) — ui — (fi1(mk) — p11) is \/2/m-subgaussian, 
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which by the properties of subgaussian random variables follows from the 
definitions of (f1;); and the algorithm. Hence by Corollary 5.5, 


Ph == fi tae A See (- mes ) 63) 


Substituting Eq. (6.3) into Eq. (6.2) and the regret decomposition (Eq. (6.1)) 
gives the result. 


The bound in Theorem 6.1 illustrates the trade-off between exploration and 
exploitation. If m is large, then the policy explores for too long, and the first 
term will be large. On the other hand, if m is too small, then the probability 
that the algorithm commits to the wrong arm will grow, and the second term 
becomes large. The question is how to choose m. Assume that k = 2 and that 
the first arm is optimal so that A, = 0, and abbreviate A = Ag. Then the bound 
in Theorem 6.1 simplifies to 
2 


A A? 
Ry < mA + (n — 2m)A exp ( -7 ) sma +ndexp (2 ) . (6.4) 
For large n the quantity on the right-hand side of Eq. (6.4) is minimised up to a 
possible rounding error by 


aces ee) Es 


and for this choice and any n, the regret is bounded by 


m, <mt fna, a+ $ (1+ max fo,ing(%)1)}. 69 


In Exercise 6.2 you will show that Eq. (6.6) implies that 
Ra <A+CyYn, (6.7) 


where C > 0 is a universal constant. In particular, when A < 1 as is often 
assumed, we get 


Ra <1+CyVn, 


Bounds of this type are called worst-case, problem free or problem 
independent (see Eq. (4.2) or Eq. (4.3)). The reason is that the bound only 
depends on the horizon and class of bandits for which the algorithm is designed, 
and not the specific instance within that class. Because the suboptimality gap does 
not appear, bounds like this are sometimes called gap-free. In contrast, bounds 
like the one in Eq. (6.6) are called gap/problem/distribution/instance 
dependent. 

Note that without the condition A < 1, the worst-case bound for ETC is 
infinite. In fact, without a bound on the reward range, the worst-case bound of 
all reasonable algorithms (that try each action at least once) will also be infinite. 
With the understanding that Eq. (6.7) gives rise to a meaningful worst-case 
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bound for bandits with bounded reward range, we take the liberty and will also 
call bounds like that in Eq. (6.7) a worst-case bound. 


The bound in (6.6) is close to optimal (see Part IV), but there is a caveat. The 
choice of m that defines the policy and leads to this bound depends on both the 
suboptimality gap and the horizon. While the horizon is sometimes known in 
advance, it is seldom reasonable to assume knowledge of the suboptimality gap. 
You will show in Exercise 6.5 that there is a choice of m depending only on n, for 
which R, = O(n?/*) regardless of the value of A. Alternatively, the number of 
plays before commitment can be made data dependent, which means the learner 
plays arms alternately until it decides based on its observations to commit to 
a single arm for the remainder (Exercise 6.5). ETC also has the property that 
its immediate expected regret per time step is monotonically decreasing as time 
goes by, though not in a nice smooth fashion. This monotone decreasing property 
is a highly desirable property. In later chapters we will see policies where the 
decrease is smoother. 


EXPERIMENT 6.1 Fig. 6.1 shows the expected regret of ETC when playing a 
Gaussian bandit with k = 2 and means pı = 0 and u2 = —A. The horizon is set 
to n = 1000, and the suboptimality gap A is varied between 0 and 1. Each data 
point is the average of 10° simulations, which makes the error bars invisible. The 
results show that the theoretical upper bound provided by Theorem 6.1 is quite 
close to the actual performance. 


oro Upper bound in (6.6) 
— ETC with m in (6.5) 


Expected regret 


Figure 6.1 The expected regret of ETC and the upper bound in Eq. (6.6). 


6.2 


6.3 
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Notes 


1 An algorithm is called anytime if it does not require advance knowledge of 
the horizon n. ETC is not anytime because the choice of commitment time 
depends on the horizon. This limitation can be addressed by the doubling 
trick, which is a simple way to convert a horizon-dependent algorithm into 
an anytime algorithm (Exercise 6.6). 

2 By allowing the exploration time m to be a data-dependent random variable, 
it is possible to recover near-optimal regret without knowing the suboptimality 
gap. For more details see Exercise 6.5. Another idea is to use an elimination 
algorithm that acts in phases and eliminates arms using increasingly sensitive 
hypothesis tests (Exercise 6.8). Elimination algorithms are often easy to analyse 
and can work well in practice, but they also have inherent limitations, just like 
ETC algorithms, as will be commented on later. 

3 The e-greedy algorithm is a randomised relative of ETC that in round t 
plays the empirically best arm with probability 1 — €+ and otherwise explores 
uniformly at random. You will analyse this algorithm in Exercise 6.7. 


Bibliographical Remarks 


ETC has a long history. Robbins [1952] considered ‘certainty equivalence with 
forcing’, which chooses the arm with the largest sample mean except at a fixed 
set of times T; C N when arm i is chosen for i € [k]. By choosing the set 
of times carefully, it is shown that this policy enjoys sublinear regret. While 
ETC performs all the exploration at the beginning, Robbins’s policy spreads 
the exploration over time. This is advantageous if the horizon is not known, 
but disadvantageous otherwise. Anscombe [1963] considered exploration and 
commitment in the context of medical trials or other experimental set-ups. He 
already largely solves the problem in the Gaussian case and highlights many of 
the important considerations. Besides this, the article is beautifully written and 
well worth reading. Strategies based on exploration and commitment are simple 
to implement and analyse. They can also generalise well to more complex settings. 
For example, Langford and Zhang [2008] consider this style of policy under the 
name ‘epoch-greedy’ for contextual bandits (the idea of exploring then exploiting 
in epochs, or intervals, is essentially what Robbins [1952] suggested). We’ll return 
to contextual bandits in Chapter 18. Abbasi-Yadkori et al. [2009], Abbasi- Yadkori 
[2009b] and Rusmevichientong and Tsitsiklis [2010] consider ETC-style policies 
under the respective names of ‘forced exploration’ and ‘phased exploration and 
greedy exploitation’ (PEGE) in the context of linear bandits (which we shall meet 
in Chapter 19). Other names include ‘forced sampling’, ‘explore-first’, ‘explore- 
then-exploit’. Garivier et al. [2016b] have shown that ETC policies are necessarily 
suboptimal in the limit of infinite data in a way that is made precise in Chapter 16. 
This comment also applies to elimination-based strategies, which are described in 


6.4 
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Exercise 6.8. The history of ¢-greedy is unclear, but it is a popular and widely used 
and known algorithm in reinforcement learning [Sutton and Barto, 1998]. Auer 
et al. [2002a] analyse the regret of e-greedy with slowly decreasing exploration 
probabilities. There are other kinds of randomised exploration as well, including 
Thompson sampling [1933] and Boltzmann exploration analysed recently by 
Cesa-Bianchi et al. [2017]. 


Exercises 


6.1 (SUBGAUSSIAN EMPIRICAL ESTIMATES) Let m be the policy of ETC and 
P,,..., Py be the 1-subgaussian distributions associated with the k arms. Provide 
a fully rigourous proof of the claim that 


fi(mk) — pi — fia(mk) + per 


is \/2/m-subgaussian. You should only use the definitions and the interaction 
protocol, which states that 


(a) P(A; €+| As, Xi,- , Ata, X¢-1) = T( | Ar, Xi,- +p Ager, X41) 2.8. 
(b) P(X; es | Ay, Xı,. bi ,At_1, Xt-1, At) = Pa, C) a.s. 


6.2 (MINIMAX REGRET) Show that Eq. (6.6) implies the regret of an optimally 
tuned ETC for subgaussian two-armed bandits satisfies R, < A + Cyn where 
C > 0 is a universal constant. 


6.3 (HIGH-PROBABILITY BOUNDS (1)) Assume that k = 2, and let 6 € (0,1). 
Modify the ETC algorithm to depend on 6 and prove a bound on the pseudo- 
regret Ry, = nu* — yy, wa, Of ETC that holds with probability 1 — ô. The 
algorithm is allowed to use the action suboptimality gaps. 


6.4 (HIGH-PROBABILITY BOUNDS (11)) Repeat the previous exercise, but now 
prove a high probability bound on the random regret: R, = nu* — Sor, Xe. 
Compare this to the bound derived for the pseudo-regret in the previous exercise. 
What can you conclude? 


6.5 (ADAPTIVE COMMITMENT TIMES) Suppose that ETC interacts with a two- 
armed 1-subgaussian bandit v € € with means u1, u2 € R and A, = |u — pol. 


(a) Find a choice of m that only depends on the horizon n and not A such that 
there exists a constant C > 0 such that for any n and for any v € E, the 
regret R,(v) of Algorithm 1 is bounded by 


Rr(v) < (Av + Cyn?! : 


Furthermore, show that there is no C > 0 such that for any problem instance 
v and n> 1, Ra(v) < A, + Cn?” holds. 


6.4 Exercises 97 


(b) Now suppose the commitment time is allowed to be data dependent, which 
means the algorithm explores each arm alternately until some condition is 
met and then commits to a single arm for the remainder. Design a condition 
such that the regret of the resulting algorithm can be bounded by 

Clogn 
Rat) <A, + =", (6.8) 


V 


where C is a universal constant. Your condition should only depend on the 
observed rewards and the time horizon. It should not depend on p1, H2 or 
Ap 

(c) Show that any algorithm for which (6.8) holds also satisfies R,(v) < 
A, + Cy/nlog(n) for any n > 1 and v € € and a suitably chosen universal 
constant C > 0. 

(d) As for (b), but now the objective is to design a condition such that for any 
n > l and v € E, the regret of the resulting algorithm is bounded by 


Clog max {e, nA?} 
A, f 


Rav) < Av 4 (6.9) 

(e) Show that any algorithm for which (6.9) holds also satisfies that for any 
n > land v € €E, R,(v) < A, + Cyn for suitably chosen universal constant 
C > 0. 


Hint For (a) start from R, < mA + nA exp(—mA?/2) and show an upper 
bound on the second term which is independent of A. Then, choose m. For 
(b) think about the simplest stopping policy and then make it robust by using 
confidence intervals. Tune the failure probability. For (c) note that the regret 
can never be larger than nA. 


6.6 (DOUBLING TRICK) The purpose of this exercise is to analyse a meta- 
algorithm based on the so-called doubling trick that converts a policy depending 
on the horizon to a policy with similar guarantees that does not. Let € be an 
arbitrary set of bandits. Suppose you are given a policy m = m(n) designed for E 
that accepts the horizon n as a parameter and has a regret guarantee of 
ee a) < fav), Weg, 

where fn : E — [0,00) is a sequence of functions. Let ny < ng < ng <- bea 
fixed sequence of integers and consider the policy that runs 7 with horizon nı 
until round t = min{n, nı }, then runs 7 with horizon ng until t = min{n, nı +n}, 
and then restarts again with horizon ng until t = min{n, nı + n2 + ns} and so-on. 
Note that t is the real-time counter and is not reset on each restart. Let 7* be the 
resulting policy. When ne+1 = 2ng, the length of periods when 7 is used double 
with each phase, hence the name ‘doubling trick’. 


(a) Let n > 0 be arbitrary, max = min{é : ar ni > n}. Prove that for any 
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v € E, the n-horizon regret of 7* on v is at most 


Lmax 


<2 Fut (6.10) 


(b) Suppose that fa(v) < vn. Show that if ng = 2⁄7}, then for any v € € and 
horizon n the regret of 7* is at most 


Ra(7*, v) < CVn, 


where C > 0 is a carefully chosen universal constant. 

(c) Suppose that f,(v) = g(v)log(n) for some function g : E + [0, o0). What is 
the regret of 7* if ne = 2⁄1? Can you find a better choice of (ne)? 

(d) In light of this idea, should we bother trying to design algorithms that do not 
depend on the horizon? Are there any disadvantages to using the doubling 
trick? If so, what are they? Write a short summary of the pros and cons of 
the doubling trick. 


According to Besson and Kaufmann [2018], the doubling trick was first 
applied to bandits by Auer et al. [1995]. Note, nowhere in this exercise did 
we use that the bandit is stochastic. Nothing changes in the adversarial or 
contextual settings studied later in the book. 


6.7 (€-GREEDY) For this exercise assume the rewards are 1-subgaussian and 
there are k > 2 arms. The ¢-greedy algorithm depends on a sequence of 
parameters €1,€9,.... First it chooses each arm once and subsequently chooses 
A, = argmax; Îĝ;(t — 1) with probability 1 — €; and otherwise chooses an arm 
uniformly at random. 

Rn 


(a) Prove that if c4 = € > 0, then lim — = 


k 
a a 
(b) Let Amin = min {A; : A; > 0} and let e, = min {1,78 iAP i}, where C > 0 is 

a sufficiently large universal constant. Prove that there exists a universal 


C” > 0 such that 


olo 


k 


A; A? 
Ras Y (a + AD logmax f e, £ un |) . 


1 min 


6.8 (ELIMINATION ALGORITHM) A simple way to generalise the ETC policy to 
multiple arms and overcome the problem of tuning the commitment time is to 
use an elimination algorithm. The algorithm operates in phases and maintains 
an active set of arms that could be optimal. In the th phase, the algorithm aims 
to eliminate from the active set all arms i for which A; > 274. 

Without loss of generality, assume that arm 1 is an optimal arm. You may 
assume that the horizon n is known. 
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1: Input: k and sequence (me)e 
2: Ay = slew meena a 
3: for €=1,2,3,... do 
A: Choose each arm i € Ag exactly me times 
5: Let Îi e be the average reward for arm i from this phase only 
6: Update active set: 
Agi = {i : fie + 27° > max fi} 
JEAe 
7: end for 
Algorithm 2: Phased elimination for finite-armed bandits 
(a) Show that for any £ > 1, 
2 20 
P(1 ¢ Agyi,1 € Ag) < kexp -= ) . 
(b) Show that if i € [k] and £ > 1 are such that A; > 27%, then 
A; = g= 
P (i € Ansa, 1€ Ar, i€ Ar) < exp ( me ( 1 L). 
(c) Let 4 = min {£ >1:2 < A,/2}. Choose me in such a way that 


(a) 


(e) 


(£) 


P (exists £: 1 € Ag) < 1/n and P (i € Ag,4i) < 1/n. 
Show that your algorithm has regret at most 


R, < Cs. (a + A log(n jj: 


where C > 0 is a carefully chosen universal constant. 
Modify your choice of mg and show that the regret of the resulting algorithm 
satisfies 


Rn <C 5 (a. + 5 logmax fe, nA? 3) 
iA;>0 


Show that with an appropriate universal constant C” > 0, the regret satisfies 


Rn < X` A; + C y/nklog(k). 


i 


Algorithm 2 is due to Auer and Ortner [2010]. The log(k) term in Part (£) can 
be removed by modifying the algorithm to use the refined confidence intervals 
in Chapter 9, but we would not recommend this for the reasons discussed 
in Section 9.2 of that chapter. You could also use a more sophisticated 
confidence level [Lattimore, 2018]. 
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Figure 6.2 Expected regret for ETC over 10° trials on a Gaussian bandit with means 
Hi = 0, p2 = —1/10 


6.9 (EMPIRICAL STUDY) In this exercise you will investigate the empirical 
behaviour of ETC on a two-armed Gaussian bandit with means yw, = 0 and 
H2 = —A. Let 


R= 5 Ah, 
t=1 


which is chosen so that R, = [Ra]. Complete the following: 


(a) Using programming language of your choice, write a function that accepts 
an integer n and A > 0 and returns the value of m that exactly minimises 
the expected regret. 

(b) Reproduce Fig. 6.1. 

(c) Fix A = 1/10 and plot the expected regret as a function of m with n = 2000. 
Your plot should resemble Fig. 6.2. 

(d) Plot the standard deviation V[R,,]'/? as a function of m for the same bandit 
as above. Your plot should resemble Fig. 6.3. 

(e) Explain the shape of the curves you observed in Parts (b), (c) and (d) and 
reconcile what you see with the theoretical results. 

(£) Think, experiment and plot. Is it justified to plot V[R,]'/? as a summary of 
how R, is distributed? Explain your thinking. 
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Figure 6.3 Standard deviation of the regret for ETC over 10° trials on a Gaussian bandit 
with means pi = 0, 2 = —1/10 


7.1 


The Upper Confidence Bound 
Algorithm 


The upper confidence bound (UCB) algorithm offers several advantages over the 
explore-then-commit (ETC) algorithm introduced in the last chapter. 


(a) It does not depend on advance knowledge of the suboptimality gaps. 

(b) It behaves well when there are more than two arms. 

(c) The version introduced here depends on the horizon n, but in the next 
chapter, we will see how to eliminate that as well. 


The algorithm has many different forms, depending on the distributional 
assumptions on the noise. Like in the previous chapter, we assume the noise is 
l-subgaussian. A serious discussion of other options is delayed until Chapter 10. 


The Optimism Principle 


The UCB algorithm is based on the principle of optimism in the face of 
uncertainty, which states that one should act as if the environment is as nice as 
plausibly possible. As we shall see in later chapters, the principle is applicable 
beyond the finite-armed stochastic bandit problem. 

Imagine visiting a new country and making a choice between sampling the local 
cuisine or visiting a well-known multinational chain. Taking an optimistic view of 
the unknown local cuisine leads to exploration because without data, it could be 
amazing. After trying the new option a few times, you can update your statistics 
and make a more informed decision. On the other hand, taking a pessimistic 
view of the new option discourages exploration, and you may suffer significant 
regret if the local options are delicious. Just how optimistic you should be is a 
difficult decision, which we explore for the rest of the chapter in the context of 
finite-armed bandits. 

For bandits, the optimism principle means using the data observed so far to 
assign to each arm a value, called the upper confidence bound that with high 
probability is an overestimate of the unknown mean. The intuitive reason why 
this leads to sublinear regret is simple. Assuming the upper confidence bound 
assigned to the optimal arm is indeed an overestimate, then another arm can only 
be played if its upper confidence bound is larger than that of the optimal arm, 
which in turn is larger than the mean of the optimal arm. And yet this cannot 
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happen too often because the additional data provided by playing a suboptimal 
arm means that the upper confidence bound for this arm will eventually fall 
below that of the optimal arm. 

In order to make this argument more precise, we need to define the upper 
confidence bound. Let (X;)?_, be a sequence of independent 1-subgaussian random 
variables with mean jz and fi = + 30), X+. By Eq. (5.6), 


2log(1/6 
P (>a Pest) sg for all 5 € (0,1). (7.1) 
n 
When considering its options in round t, the learner has observed T;(t — 1) 
samples from arm 7 and received rewards from that arm with an empirical mean 
of fi;(t — 1). Then a reasonable candidate for ‘as large as plausibly possible’ for 


the unknown mean of the ith arm is 


DaN R T ai ee (7.2) 
a t— , = m . 
a(t —1)+,/228C/9) otherwise. 


T;(t—1) 
Great care is required when comparing (7.1) and (7.2) because in the former the 
number of samples is the constant n, but in the latter it is a random variable 
T(t — 1). By and large, however, this is merely an annoying technicality, and the 
intuition remains that 6 is approximately an upper bound on the probability of 
the event that the above quantity is an underestimate of the true mean. More 
details are given in Exercise 7.1. 
At last we have everything we need to state a version of the UCB algorithm, 
which takes as input the number of arms and the error probability 6. 


1: Input k and ô 

2: for tE 1,...,n do 

3: Choose action A; = argmax; UCB; (t — 1, ô) 

4 Observe reward X; and update upper confidence bounds 
5: end for 


Algorithm 3: UCB(ô). 


Although there are many versions of the UCB algorithm, we often do not 
distinguish them by name and hope the context is clear. For the rest of this 
chapter, we’ll usually call UCB(0) just UCB. 


The value inside the argmax is called the index of arm i. Generally speaking, 
an index algorithm chooses the arm in each round that maximises some value 
(the index), which usually only depends on the current time step and the samples 
from that arm. In the case of UCB, the index is the sum of the empirical mean 
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of rewards experienced so far and the exploration bonus, which is also known 
as the confidence width. 

Besides the slightly vague ‘optimism guarantees optimality or learning’ intuition 
we gave before, it is worth exploring other intuitions for the choice of index. At 
a very basic level, an algorithm should explore arms more often if they are (a) 
promising because ji;(t — 1) is large or (b) not well explored because T;(t — 1) is 
small. As one can plainly see, the definition in Eq. (7.2) exhibits this behaviour. 
This explanation is not completely satisfying, however, because it does not explain 
why the form of the functions is just so. 

A more refined explanation comes from thinking of what we expect of any 
reasonable algorithm. Suppose at the start of round t the first arm has been 
played much more frequently than the rest. If we did a good job designing our 
algorithm, we would hope this is the optimal arm, and because it has been played 
so often, we expect that fii(t— 1) ~ uı. To confirm the hypothesis that arm 1 is 
optimal, the algorithm had better be highly confident that other arms are indeed 
worse. This leads quite naturally to the idea of using upper confidence bounds. 
The learner can be reasonably certain that arm i is worse than arm 1 if 


2 log(1/6) 


ûi(t— 1 a, Ia xÀ 
f(t — 1) + Tit-1) SH ~A 


(¢-1)+ (7.3) 
where 6 is called the confidence level and quantifies the degree of certainty. 
This means that choosing the arm with the largest upper confidence bound leads 
to a situation where arms are only chosen if their true mean could reasonably be 
larger than those of arms that have been played often. That this rule is indeed a 
good one depends on two factors. The first is whether the width of the confidence 
interval at a given confidence level can be significantly decreased, and the second 
is whether the confidence level is chosen in a reasonable fashion. For now, we 
will take a leap of faith and assume that the width of confidence intervals for 
subgaussian bandits cannot be significantly improved from what we use here 
(we shall see that this holds in later chapters), and concentrate on choosing the 
confidence level now. 


Choosing the confidence level is a delicate problem, and we will analyse a 
number of choices in future chapters. The basic difficulty is that 6 should 
be small enough to ensure optimism with high probability, but not so small 
that suboptimal arms are explored excessively. 


Nevertheless, as a first cut, the choice of this parameter can be guided by 
the following considerations. If the confidence interval fails and the index of an 
optimal arm drops below its true mean, then it could happen that the algorithm 
stops playing the optimal arm and suffers linear regret. This suggests we might 
choose 6 ~ 1/n so that the contribution to the regret of this failure case is 
relatively small. Unfortunately things are not quite this simple. As we have 
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already alluded to, one of the main difficulties is that the number of samples 
T;(t — 1) in the index (7.2) is a random variable, and so our concentration results 
cannot be immediately applied. For this reason we will see that (at least naively) 
ô should be chosen a bit smaller than 1/n. 


THEOREM 7.1. Consider UCB as shown in Algorithm 3 on a stochastic k-armed 
1-subgaussian bandit problem. For any horizon n, if 5 =1/n?, then 


k 16 log(n) 
Rn <3 Ai + =, 
3 », Ai 

Before the proof we need a little more notation. Let (X%i)e{nj,ie{4] be a collection 

of independent random variables with the law of X;; equal to P;. Then define 

jus = DD X,; to be the empirical mean based on the first s samples. We 

make use of the third model in Section 4.6 by assuming that the reward in round 
t is 


Xi = KTaO) 


Then we define /i;(t) = fiyr,(4) to be the empirical mean of the ith arm after round 
t. The proof of Theorem 7.1 relies on the basic regret decomposition identity, 


k 
Rn = = AE [T;(n)] . (Lemma 4.5) 


The theorem will follow by showing that E [T;(n)] is not too large for suboptimal 
arms i. The key observation is that after the initial period where the algorithm 
chooses each action once, action 7 can only be chosen if its index is higher than 
that of an optimal arm. This can only happen if at least one of the following is 
true: 


(a) The index of action i is larger than the true mean of a specific optimal arm. 
(b) The index of a specific optimal arm is smaller than its true mean. 


Since with reasonably high probability the index of any arm is an upper bound 
on its mean, we don’t expect the index of the optimal arm to be below its 
mean. Furthermore, if the suboptimal arm 7 is played sufficiently often, then its 
exploration bonus becomes small and simultaneously the empirical estimate of 
its mean converges to the true value, putting an upper bound on the expected 
total number of times when its index stays above the mean of the optimal arm. 
The proof that follows is typical for the analysis of algorithms like UCB, and 
hence we provide quite a bit of detail so that readers can later construct their 
own proofs. 


Proof of Theorem 7.1 Without loss of generality, we assume the first arm is 
optimal so that uı = u*. As noted above, 


k 
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The theorem will be proven by bounding E[T;(n)] for each suboptimal arm i. We 
make use of a relatively standard idea, which is to decouple the randomness from 
the behaviour of the UCB algorithm. Let G; be the ‘good’ event defined by 


Gi = {as < a UCB, (t, ò) } N T + i| bos G) < m} ; 


where u; € [n] is a constant to be chosen later. So G; is the event when pı is 
never underestimated by the upper confidence bound of the first arm, while at the 
same time the upper confidence bound for the mean of arm i after u; observations 
are taken from this arm is below the pay-off of the optimal arm. We will show 
two things: 


1 If G; occurs, then arm i will be played at most u; times: T;(n) < uj. 


2 The complement event G§ occurs with low probability (governed in some way 
yet to be discovered by w;). 


Because T;(n) < n no matter what, this will mean that 


[Ti (n)] = E [I {G} Ti(n)] + E [I {G7} Ti(m)] < ui +P (G?) n. (7.5) 


The next step is to complete our promise by showing that T;(n) < u; on G; and 
that P (GF) is small. Let us first assume that G; holds and show that T;(n) < ui, 
which we do by contradiction. Suppose that T;(n) > u;. Then arm 7 was played 
more than u; times over the n rounds, and so there must exist a round t € [n] 
where T;(t — 1) = u; and A; = i. Using the definition of Gi, 


2 log(1 
UCB,(t — 1,5) = f(t — 1) 4 ae (definition of UCB,(t — 1, ô)) 
=A an cheni- isn 
< py (definition of G;) 
< UCB, (t — 1,6). (definition of G;) 


Hence A; = argmax,; UCB;(t — 1,6) 4 i, which is a contradiction. Therefore if 
G; occurs, then T;(n) < u;. Let us now turn to upper bounding P (G$). By its 
definition, 


2 log(1/6) 


Ui 


Go= fpa > min uce,(t.5)} U 4 fiiu; + > mọ. (7.6) 
tEjn 
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The first of these sets is decomposed using the definition of UCB, (t, ô), 


f 2 min UCB:( A) } = fu > min fj, + eo 
tE[n] sefn] m 
2log(1/6 
T U {i= et ost) 
s€[n] 


Then using a union bound and the concentration bound for sums of independent 
subgaussian random variables in Corollary 5.5, we obtain: 


P @ > min UCBi (5) <P U fu > flis + 
te [n] 


s€[n] 


< S oP (n > flig + “esti Sno. (7.7) 
s=l1 


Zea) 


S 


S 


The next step is to bound the probability of the second set in (7.6). Assume that 
u; is chosen large enough that 


for some c € (0,1) to be chosen later. Then, since 4, = pi + A;, and using 
Corollary 5.5, 


P | pin, ANB) o =P | piu, — hi 2 Ai- S 
Ui Ui 
< P (fiu; — Hi > cA;) < exp ~) . 
Taking this together with (7.7) and (7.6), we have 
P (GF) < nô 4 exp ( scan : 
When substituted into Eq. (7.5), we obtain 
i [Ti(n)] < ui tn (ns exp ( =) ; (7.9) 


It remains to choose u; € [n] satisfying (7.8). A natural choice is the smallest 
integer for which (7.8) holds, which is 


“= arga 


This choice of u; can be larger than n, but in this case Eq. (7.9) holds trivially 
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since T;(n) < n. Then, using the assumption that 6 = 1/n? and this choice of u; 
leads via (7.9) to 


2log(n?) 
(= e)a} 


o[Ti(n)] < wi t+ 1+ winte (Se? — | | tag ANg, (7.10) 
All that remains is to choose c € (0,1). The second term will contribute a 
polynomial dependence on n unless 2c? /(1 — c)? > 1. However, if ¢ is chosen too 
close to 1, then the first term blows up. Somewhat arbitrarily we choose c = 1/2, 
which leads to 


snl s3 SS 


The result follows by substituting the above display in Eq. (7.4). 


As we saw for the ETC strategy, the regret bound in Theorem 7.1 depends 
on the reciprocal of the gaps, which may be meaningless when even a single 
suboptimal action has a very small suboptimality gap. As before, one can also 
prove a sublinear regret bound that does not depend on the reciprocal of the 


gaps. 


THEOREM 7.2. If 5 =1/n?, then the regret of UCB, as defined in Algorithm 3, 
on any v € Ea (1) environment, is bounded by 


k 
Rn < 8y/nklog(n) +3 X- A;. 
{=L 


Proof Let A > 0 be some value to be tuned subsequently, and recall from the 
proof of Theorem 7.1 that for each suboptimal arm i, we can bound 


16log(n) 
A? 


a 


o[Ti(m)] < 3 + 


Therefore, using the basic regret decomposition again (Lemma 4.5), we have 


k 
Rn = AE[Ti(n)) = X AEn] + X AE[Ti(n)] 


i=1 iA <A i:Ay>A 
16 log(n) 16k log(n) 
< : ieee ae < ee - 
<nA+ 2. (sa. + <nA+ +3 ` A; 


k 
< 8 y/nklog(n) +3" As, 
s= 


where the first inequality follows because >7;.,, <a Ti(n) < n and the last line by 
choosing A = ,/16k log(n)/n. 


The additive Ss A; term is unavoidable because no reasonable algorithm can 
avoid playing each arm once (try to work out what would happen if it did not). 


In any case, this term does not grow with the horizon n and is typically negligible. 
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Figure 7.1 Experiment showing universality of UCB relative to fixed instances of ETC 


As it happens, Theorem 7.2 is close to optimal. We will see in Chapter 15 that 
no algorithm can enjoy regret smaller than O(v'nk) over all problems in Eg (1). 
In Chapter 9 we will also see a more complicated variant of Algorithm 3 that 
shaves the logarithmic term from the upper bound given above. 


EXPERIMENT 7.1 We promised that UCB would overcome the limitations 
of ETC by achieving the same guarantees but without prior knowledge of 
the suboptimality gaps. The theory supports this claim, but just because two 
algorithms have similar theoretical guarantees does not mean they perform the 
same empirically. The theoretical analysis might be loose for one algorithm and 
maybe not the other, or by a different margin. For this reason it is always wise to 
prove lower bounds (which we do later) and compare the empirical performance, 
which we do (very briefly) now. 

The set-up is the same as in Fig. 6.1, which has n = 1000 and k = 2 and 
unit variance Gaussian rewards with means 0 and —A respectively. The plot in 
Fig. 7.1 shows the expected regret of UCB relative to ETC for a variety of choices 
of commitment time m. The expected regret of ETC with the optimal choice of 
m (which depends on the knowledge of A and that the pay-offs are Gaussian, cf. 
Fig. 6.1) is also shown. 


The results demonstrate a common phenomenon. If ETC is tuned with the 
optimal choice of commitment time for each choice of A, then it outperforms 
the parameter-free UCB, though only by a relatively small margin. If, 
however, the commitment time must be chosen without the knowledge of 
A, then ETC will usually not outperform UCB. As it happens, a variant of 
UCB introduced in the next chapter actually outperforms even the optimally 
tuned ETC. 
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7.2 Notes 


1 The choice of 6 = 1/n? led to an easy analysis, but comes with two 
disadvantages. First of all, it turns out that a slightly smaller value of 6 
improves the regret (and empirical performance). Secondly, the dependence on 
n means the horizon must be known in advance, which is often not reasonable. 
Both of these issues are resolved in the next chapter, where 6 is chosen to be 
smaller and to depend on the current round t rather than n. Nonetheless — as 
promised — Algorithm 3 with ô = 1/n? does achieve a regret bound similar to 
the ETC strategy, but without requiring knowledge of the gaps. 


2 The assumption that the rewards generated by each arm are independent can 
be relaxed significantly. All of the results would go through by assuming there 
exists a mean reward vector u € R? such that 


[X | Xı, Aj, se ,At—1, Xt-1, At] = WA, a.S.. (7.11) 
sfexp(A(X; — wa,)) | Xa, Ar,---, At—1, Xt-1, At] < exp(A?/2) a.s.. (7.12) 


Eq. (7.11) is just saying that the conditional mean of the reward in round t 
only depends on the chosen action. Eq. (7.12) ensures that the tails of X, are 
conditionally subgaussian. That everything still goes through is proven using 
martingale techniques, which we develop in detail in Chapter 20. 


3 So is the optimism principle universal? Does it always lead to policies with 
strong guarantees in more complicated settings? Unfortunately the answer turns 
out to be no. The optimism principle usually leads to reasonable algorithms 
when (i) any action gives feedback about the quality of that action and (ii) no 
action gives feedback about the value of other actions. When (i) is violated, even 
sublinear regret may not be guaranteed. When (ii) is violated, an optimistic 
algorithm may avoid actions that lead to large information gain and low reward, 
even when this trade-off is optimal. An example where this occurs is provided 
in Chapter 25 on linear bandits. Optimism can work in more complex models as 
well, but sometimes fails to appropriately balance exploration and exploitation. 


4 When thinking about future outcomes, humans and some animals often have 
higher expectations than are warranted by past experience or conditions of the 
environment. This phenomenon, a form of cognitive bias, is known as the 
optimism bias in the psychology and behavioural economics literature and is 
in fact ‘one of the most consistent, prevalent, and robust biases documented in 
psychology and behavioural economics’ [Sharot, 2011a]. While much has been 
written about this bias in these fields, and one of the current explanations 
of why the optimism bias is so prevalent is that it helps exploration, to our 
best knowledge, the connection to the deeper mathematical justification of 
optimism, pursued here and in other parts of this book, has so far escaped the 
attention of researchers in all the relevant fields. 
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Bibliographical Remarks 


The use of confidence bounds and the idea of optimism first appeared in the work 
by Lai and Robbins [1985]. They analysed the asymptotics for various parametric 
bandit problems (see the next chapter for more details on this). The first version 
of UCB is by Lai [1987]. Other early work is by Katehakis and Robbins [1995], 
who gave a very straightforward analysis for the Gaussian case, and Agrawal 
[1995], who noticed that all that was needed is an appropriate sequence of 
upper confidence bounds on the unknown means. In this way, their analysis is 
significantly more general than what we have done here. These researchers also 
focused on the asymptotics, which at the time was the standard approach in 
the statistics literature. The UCB algorithm was independently discovered by 
Kaelbling [1993], although with no regret analysis or clear advice on how to tune 
the confidence parameter. The version of UCB discussed here is most similar to 
that analysed by Auer et al. [2002a] under the name UCB1, but that algorithm 
used ¢ rather than n in the confidence level (see the next chapter). Like us, they 
prove a finite-time regret bound. However, rather than considering 1-subgaussian 
environments, Auer et al. [2002a] considers bandits where the pay-offs are confined 
to the [0,1] interval, which are ensured to be 1/2-subgaussian. See Exercise 7.2 
for hints on what must change in this situation. The basic structure of the proof 
of our Theorem 7.1 is essentially the same as that of theorem 1 of Auer et al. 
(2002a]. The worst-case bound in Theorem 7.2 appeared in the book by Bubeck 
and Cesa-Bianchi [2012], which also popularised the subgaussian set-up. We did 
not have time to discuss the situation where the subgaussian constant is unknown. 
There have been several works exploring this direction. If the variance is unknown, 
but the noise is bounded, then one can replace the subgaussian concentration 
bounds with an empirical Bernstein inequality [Audibert et al., 2007]. For details, 
see Exercise 7.6. If the noise has heavy tails, then a more serious modification is 
required, as discussed in Exercise 7.7 and the note that follows. 

We found the article by Sharot [2011la] on optimism bias from the psychology 
literature quite illuminating. Readers looking to dive deeper into this literature 
may enjoy the book by the same author [Sharot, 2011b]. Optimism bias is also 
known as ‘unrealistic optimism’, a term that is most puzzling to us — what bias 
is ever realistic? The background of this is explained by Jefferson et al. [2017]. 


Exercises 


7.1 (CONCENTRATION FOR SEQUENCES OF RANDOM LENGTH) In this exercise, 
we investigate one of the more annoying challenges when analyzing sequential 
algorithms. Let X1, X2,... be a sequence of independent standard Gaussian 
random variables defined on probability space (Q, F, P). Suppose that T : Q > 
{1,2,3,...} is another random variable, and let fi = Shey X+/T be the empirical 
mean based on T samples. 
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(a) Show that if T is independent from X; for all t, then 


(i Ia Pes) ss, 


(b) Now relax the assumption that T is independent from (X;);. Let E; = 
I{T = t} be the event that T = t and F; = o (X1, ..., X+) be the o-algebra 
generated by the first t samples. Let ô € (0,1) and show there exists a T 
such that for all ¢ € {1,2,3,...} it holds that E; is #;-measurable and 


(i a matm) = 


P (an jas 0) a. pia) 


(c) Show that 


T 


Hint For part (b) above, you may find it useful to apply the law of the iterated 
logarithm, which says if X1, X2,... is a sequence of independent and identically 
distributed random variables with zero mean and unit variance, then 


lim sup Daim Ms =1 
noo V2nloglogn 
This result is especially remarkable because it relies on no assumptions other 
than zero mean and unit variance. You might wonder if Eq. (7.13) might continue 
to hold if log(T(T + 1)/5) were replaced by log(log(T)/65). It almost does, but 
the proof of this fact is more sophisticated. For more details, see the paper by 
Garivier [2013] or Exercise 20.9. 


almost surely . 


7.2 (RELAXING THE SUBGAUSSIAN ASSUMPTION) In this chapter, we assumed 
the pay-off distributions were 1-subgaussian. The purpose of this exercise is to 
relax this assumption. 


(a) First suppose that o? > 0 is a known constant and that v € E&,(07). Modify 
the UCB algorithm and state and prove an analogue of Theorems 7.1 and 7.2 
for this case. 

(b) Now suppose that v = (P;)*_, is chosen so that P; is o;-subgaussian where 
(o?)*_, are known. Modify the UCB algorithm and state and prove an 
analogue of Theorems 7.1 and 7.2 for this case. 

(c) If you did things correctly, the regret bound in the previous part should not 
depend on the values of {o? : A; = 0}. Explain why not. 


7.3 (HIGH-PROBABILITY BOUNDS) Recall from Chapter 4 that the pseudo-regret 
is defined to be the random variable 


R, = 3 Aa. 
t=1 
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The UCB policy in Algorithm 3 depends on confidence parameter 6 € (0, 1] that 
determines the level of optimism. State and prove a bound on the pseudo-regret 
of this algorithm that holds with probability 1 — f(n,k)é, where f(n,k) is a 
function, that depends on n and k only. More precisely show that for bandit 
v € €&,(1) that 


P (Rn > 9(n,v,6)) < f(n,k)6, 


where g and f should be as small as possible (there are trade-offs — try and come 
up with a natural choice). 


7.4 (PHASED UCB (1)) Fix a 1-subgaussian k-armed bandit environment and a 
horizon n. Consider the version of UCB that works in phases of exponentially 
increasing length of 1,2,4,.... In each phase, the algorithm uses the action that 
would have been chosen by UCB at the beginning of the phase (see Algorithm 4 
below). 


(a) State and prove a bound on the regret for this version of UCB. 
(b) Compare your result with Theorem 7.1. 


(c) How would the result change if the @th phase had a length of laf] with 
a>? 


: Input k and ô 

: Choose each arm once 

: for 2=1,2,... do 

Compute A; = argmax, UCB,(t — 1, ô) 
Choose arm Ay exactly 2° times 


anak wre 


: end for 


Algorithm 4: A phased version of UCB. 


7.5 (PHASED UCB (11)) Let a > 1 and consider the version of UCB that first 
plays each arm once. Thereafter it operates in the same way as UCB, but rather 
than playing the chosen arm just once, it plays it until the number of plays of 
that arm is a factor of a larger (see Algorithm 5 below). 


(a) State and prove a bound on the regret for version of UCB with a = 2 
(doubling counts). 


(b) Compare with the result of the previous exercise and with Theorem 7.1. 
What can you conclude? 


(c) Repeat the analysis for a > 1. What is the role of a? 
(d) Implement these algorithms and compare them empirically to UCB(6). 
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1: Input k and ô 

2: Choose each arm once 

3: for l = 1,2,... do 

4: Let te =t 

5 Compute Ay = argmax; UCB; (te — 1, ô) 

6 Choose arm Ay until round t such that T;(t) > aT;(te — 1) 
7: end for 


Algorithm 5: A phased version of UCB. 


The algorithms of the last two exercises may seem ridiculous. Why would 
you wait before updating empirical estimates and choosing a new action? 
There are at least two reasons: 


(a) It can happen that the algorithm does not observe its rewards 
immediately, but rather they appear asynchronously after some delay. 
Alternatively many bandits algorithms may be operating simultaneously 
and the results must be communicated at some cost. 

(b) If the feedback model has a more complicated structure than what we 
examined so far, then even computing the upper confidence bound just 
once can be quite expensive. In these circumstances, it’s comforting to 
know that the loss of performance by updating the statistics only rarely 
is not too severe. 


7.6 (ADAPTING TO REWARD VARIANCE IN BANDITS WITH BOUNDED REWARDS) 
Let X1, X2,..., Xn be a sequence of independent and identically distributed 
random variables with mean ys and variance g? and bounded support so that 
X, € [0,b] almost surely. Let fi = 0), X;/n and 6? = X; (A — X+)? /n. The 
empirical Bernstein inequality says that for any 6 € (0,1), 


262 b 
P | |â- al > ,/— log Ja log <ô. 
n ô n ô 


(a) Show that 6? = + Y; (X: — u? — (Ê - u)’. 
(b) Show that V[(X; — 1)?] < b?0?. 
(c) Use Bernstein’s inequality (Exercise 5.14) to show that 


22 2 
P{é?>o07+ dea T z i log : <ó. 
n ô 3n ô 


(d) Suppose that v = (vi)$; is a bandit where Supp(v;) C [0, b] and the variance 
of the ith arm is o? (with our earlier notation, v € En 5): Design a policy 
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that depends on b, but not o? such that 


Rc X` (a+ (o+ z) log(n) ) , (7.14) 


i:A;>0 


where C > 0 is a universal constant. 


If you did things correctly, then the policy you derived in Exercise 7.6 
should resemble UCB-V by Audibert et al. [2007]. The proof of the empirical 
Bernstein also appears there or (with slightly better constants) in the papers 
by Mnih et al. [2008] and Maurer and Pontil [2009]. 


It is worth comparing (7.14) to the result of Theorem 7.1. In particular, 
recall that if the rewards are bounded by b, the reward distributions are 
b-subgaussian. The regret of UCB which adjusts the confidence intervals 
accordingly can then be shown to be Ry = O())j.a,s0 Plogtn) Thus, the 
main advantage of the policy of the previous exercise is the replacement of 
b/A; in this bound with b+ gi . In Exercise 16.7, you will show that this is 
essentially unimprovable. Í 


7.7 (MEDIAN OF MEANS AND BANDITS WITH KNOWN FINITE VARIANCE) 
Let n € Nt and (A;)™, be a partition of [n] so that UA; = [n] and 
A, A; = Ú for all i # j. Suppose that ô € (0,1) and X1, X2,..., Xn is a 
sequence of independent random variables with mean p and variance o?. The 
median-of-means estimator ÎĤm of u is the median of ĝ1, fi2,..., fim, where 
fi = rica, Xt/|Ail is the mean of the data in the ith block. 


e1/8 


(a) Show that if m = [min {3, 8log ( 5 )\ and A; are chosen as equally 
sized as possible, then 


19202 1/8 
p (auy 220 toe (=) <n) <6. 


(b) Use the median-of-means estimator to design an upper confidence bound 
algorithm such that for all v € €£(07), 


Rn <C 5 (a ew) 


, Ai 
i:A;>0 


where C > 0 is a universal constant. 
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This exercise shows that the subgaussian assumption can be relaxed to 
requiring only finite variance at the price of increased constant factors. The 
result is only possible by replacing the standard empirical estimator with 
something more robust. The median-of-means estimator is only one way to 
do this. In fact, the empirical estimator can be made robust by truncating 
the observed rewards and applying the empirical Bernstein concentration 
inequality. The disadvantage of this approach is that choosing the location 
of truncation requires prior knowledge about the approximate location of 
the mean. Another approach is Catoni’s estimator, which also exhibits 
excellent asymptotic properties [Catoni, 2012]. Yet another idea is to minimise 
the Huber loss [Sun et al., 2017]. This latter paper is focussing on linear 
models, but the results still apply in one dimension. The application of these 
ideas to bandits was first made by Bubeck et al. [2013a], where the reader 
will find more interesting results. Most notably, that things can still be made 
to work even if the variance does not exist. In this case, however, there is a 
price to be paid in terms of the regret. The median-of-means estimator is due 
to Alon et al. [1996]. In case the variance is also unknown, then it may be 
estimated by assuming a known bound on the kurtosis, which covers many 
classes of bandits (Gaussian with arbitrary variance, exponential and many 
more), but not some simple cases (Bernoulli). The policy that results from 
this procedure has the benefit of being invariant under the transformations 
of shifting or scaling the losses [Lattimore, 2017]. 


7.8 (EMPIRICAL COMPARISON) 


(a) 
(b) 
(c) 


(d) 
(e) 


Implement Algorithm 3. 

Reproduce Fig. 7.1. 

Explain the shape of the curves for ETC. In particular, when m = 50, we 
see a bump, a dip and then a linear asymptote as A grows. Why does the 
curve look like this? 

Design an experiment to determine the practical effect of the choice of 6. 
Explain your results. 


8.1 


The Upper Confidence Bound 
Algorithm: Asymptotic Optimality 


The algorithm analysed in the previous chapter is not anytime. This shortcoming 
is resolved via a slight modification and a refinement of the analysis. The improved 
analysis leads to constant factors in the dominant logarithmic term that match a 
lower bound provided later in Chapter 16. 


Asymptotically Optimal UCB 


The algorithm studied is shown in Algorithm 6. It differs from the one analysed 
in the previous section (Algorithm 3) only by the choice of the confidence level, 
the choice of which is dictated by the analysis of its regret. 


1: Input k 

2: Choose each arm once 

3: Subsequently choose 

T;(t — 1) 


A; = argmax, (re =T 2log s) 


where f(t) = 1+ tlog?(t) 


Algorithm 6: Asymptotically optimal UCB. 


The regret bound for Algorithm 6 is more complicated than the bound for 
Algorithm 3 (see Theorem 7.1). The dominant terms in the two results have the 
same order, but the gain here is that in this result the leading constant, governing 
the asymptotic rate of growth of regret, is smaller. 


THEOREM 8.1. For any 1-subgaussian bandit, the regret of Algorithm 6 satisfies 


2(1 n)+/71 ny) +1 
EREN (log f(n) + log F(n) +1) 


: .1 
e€(0,A;) E? = (A; = E)? ( ) 
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Furthermore, 


Rn 2 
li < —. 2 
imu A D 2 (8.2) 
Choosing £ = A;/2 inside the sum shows that 


ne (a, + 5 (81g f(n) ) + 8,/mlog f(n j+28)) (8.3) 
Ag >0 
Even more concretely, there exists some universal constant C > 0 such that 


Re D (a+ losin aw) 


“:A,>0 


which by the same argument as in the proof of Theorem 7.2 leads a worst-case 


bound of Ry < CEE Ai + 2,\/Cnklog(n). 


Taking the limit of the ratio of the bound in (8.3) and log(n) does not result 
in the same constant as in the theorem, which is the main justification for 
introducing the more complicated regret bound. You will see in Chapter 15 
that the asymptotic bound on the regret given in (8.2) is unimprovable in a 
strong sense. 


We start with a useful lemma to bound the number of times the index of a 
suboptimal arm will be larger than some threshold above its mean. 


LEMMA 8.2. Let Xı,..., Xn be a sequence of independent 1-subgaussian random 
variables, fi, = Ly Xs,E>0,a>0 and 


EE e}, m aii a 


t=[u] 


where u = 2ae~?. Then it holds E[k] < E[x’] < 1 Poa F vra +1). 


The intuition for this result is as follows. Since a X; are 1-subgaussian and 
independent we have E|ĝ:] = 0, so we cannot expect fi; + \/2a/t to be smaller 
than € until t is at least 2a/e?. The lemma confirms that this is the right order 
as an estimate for E [x]. 


Proof By Corollary 5.5 we have 


jk] < wut Solan y> e) sut X ox Le 2a) 


t=[u] t=[u] 


2 
o t(e- y2) 2 
<itut f exp | -——; — | dt=14+ Slat Vra+), 
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where the final equality follows by making the substitution s = evt — 2a and 
substituting the value of u from the lemma statement. 0O 


Proof of Theorem 8.1 As usual, the starting point is the fundamental regret 
decomposition (Lemma 4.5), 


t:A;>0 


The rest of the proof revolves around bounding E[T;(n)]. Let i be a suboptimal 
arm. The main idea is to decompose T;(n) into two terms. The first measures the 
number of times the index of the optimal arm is less than u1 — €. The second term 
measures the number of times that A; = i and its index is larger than py — €. 


Tn) = Ð HA =i} < Srat- ve ait <n e} 


+ Sorta 1) 4 Tee > p e and ami (8.4) 


The proof of the first part of the theorem is completed by bounding the expectation 
of each of these two sums. Starting with the first, we again use Corollary 5.5: 


Eon freien} 
sS (m E i -) 


n n S ( mie Td gal te) 
<2 ew | - 


~~ 1g se? 5 
slam yer) sz 


The first inequality follows from the union bound over all possible values of 
T\(t — 1). The last inequality is an algebraic exercise (Exercise 8.1). The function 
f(t) was chosen precisely so this bound would hold. For the second term in (8.4) 
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we use Lemma 8.2 to get 


[Er ae 1) 4 PEI > p sand a =i} 


IA 
A 
iM» 
—_ eo aa 
T 
: = 
+ 
= 
N 
par 
& 
oy 
ae 
2 
IV 
a) 
a 
us 


IA 
= 
4+ 


E (log f(n) + Vlog f) +1) . 


The first part of the theorem follows by substituting the results of the previous 
two displays into (8.4). The second part follows by choosing € = log~1/ *(n) and 
taking the limit as n tends to infinity. 


Notes 


k 


The improvement to the constants comes from making the confidence interval 
slightly smaller, which is made possible by a more careful analysis. The main 
trick is the observation that we do not need to show that fi1, > u for all s 
with high probability, but instead that f,, > pı — € for small e. 

The choice of f(t) = 1 + tlog?(t) looks quite odd. With a slightly messier 
calculation we could have chosen f(t) = tlog*(t) for any a > 0. If the rewards 
are actually Gaussian, then a more careful concentration analysis allows one 


N 


to choose f(t) = t or even some slightly slower-growing function [Katehakis 
and Robbins, 1995, Lattimore, 2016a, Garivier et al., 2016b]. 

The asymptotic regret is often indicative of finite-time performance. The reader 
is advised to be cautious, however. The lower-order terms obscured by the 
asymptotics can be dominant in all practical regimes. 


w 


Bibliographic Remarks 


Lai and Robbins [1985] designed policies for which Eq. (8.2) holds. They also 
proved a lower bound showing that no ‘reasonable’ policy can improve on this 
bound for any problem, where ‘reasonable’ means that they suffer subpolynomial 
regret on all problems (see Part IV). The policy proposed by Lai and Robbins 
[1985] was based on upper confidence bounds, but was not a variant of UCB. The 
asymptotics for variants of the policy presented here were given first by Lai [1987], 
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Katehakis and Robbins [1995] and Agrawal [1995]. None of these articles gave 
finite-time bounds like what was presented here. When the reward distributions 
lie in an exponential family, then asymptotic and finite-time bounds with the 
same flavor to what is presented here are given by Cappé et al. [2013]. There are 
now a huge variety of asymptotically optimal policies in a wide range of settings. 
Burnetas and Katehakis [1996] study the general case and give conditions for 
a version of UCB to be asymptotically optimal. Honda and Takemura [2010, 
2011] analyse an algorithm called DMED, proving asymptotic optimality for noise 
models where the support is bounded or semi-bounded. Kaufmann et al. [2012b] 
prove asymptotic optimality for Thompson sampling (see Chapter 36) when 
the rewards are Bernoulli, which is generalised to single-parameter exponential 
families by Korda et al. [2013]. Kaufmann [2018] proves asymptotic optimality 
for the BayesUCB class of algorithms for single-parameter exponential families. 
Ménard and Garivier [2017] prove asymptotic optimality and minimax optimality 
for exponential families (more discussion in Chapter 9). 


Exercises 


8.1 Do the algebra needed at the end of the proof of Theorem 8.1. Precisely, 
show that 


a ee ( =) 5 

Saboh), 
t=1 F(t) s=1 2 5 

where f(t) = 1+ tlog?(t). 

Hint First bound F = 37"_, exp(—se?/2) using a geometric series. Then show 

that exp(—a)/(1—exp(—a)) < 1/a holds for any a > 0 and conclude that F < 3. 


Finish by bounding X`; 1/f(¢) using the fact that 1/f(t) < 1/(tlog(t)?) and 
bounding a sum by an integral. 


8.2 (ONE-ARMED BANDITS) Consider the one-armed bandit problem: € = 
{N (p11, 1) : wi E R} x {M (0, 1)}. Suppose that v = (P, P2) € E and P, has mean 
[iy = 1. Evaluate 


A Rn (x, V ) 
lim sup ———— 
n=>œ log(n) 


d 


where ~v is the policy of Algorithm 6. 


8.3 (ONE-ARMED BANDITS (11)) Consider the setting of Exercise 8.2 and define 


a policy by 
EN i 2log f(t) 
Am 1 if f,(t-1)4 fei > 0 (8.5) 


2 otherwise. 
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Suppose that v = (P,, P2) where P, = N (11,1) and P> = N(0,1). Prove that 
for the modified policy, 


lim sup Rr(v) 


0  ifm>0 


HINT Follow the analysis for UCB, but carefully adapt the proof by using the 
fact that the index of the second arm is always zero. 


The strategy proposed in the above exercise is based on the idea that 
optimism is used to overcome uncertainty in the estimates of the quality of 
an arm, but for one-armed bandits the mean of the second arm is known in 
advance. 


8.4 (ONE-ARMED BANDITS (111)) The purpose of this question is to compare 
UCB and the modified version in (8.5). 


(a) Implement a simulator for the one-armed bandit problem and two algorithms: 
UCB and the modified version analysed in Exercise 8.3. 

(b) Use your simulator to estimate the expected regret of each algorithm for a 
horizon of n = 1000 and 4 € [—1, 1]. 

(c) Plot your results with u; on the x-axis and the estimated expected regret on 
the y-axis. Don’t forget to label the axis and include error bars and a legend. 

(d) Explain the results. Why do the curves look the way they do? 

(e) In your plot, for what values of uı does the worst-case expected regret 
for each algorithm occur? What is the worst-case expected regret for each 
algorithm? 


8.5 (DIFFERENT SUBGAUSSIAN CONSTANTS) Let o? € [0,00)* be known and 
suppose that the reward is X; ~ N(wa,, Oa): Design an algorithm (that depends 
on g?) for which the asymptotic regret is 


Rn 20? 
lim sup ——~ = 5 T 
noo, log(n) i:A,>0 Ai 


9.1 


The Upper Confidence Bound 
Algorithm: Minimax Optimality (<) 


We proved that the variants of UCB analysed in the last two chapters have a 
worst-case regret of Rn = O(,/knlog(n)). Further, in Exercise 6.8 you showed 
that an elimination algorithm achieves Rn = O(,/knlog(k)). By modifying the 
confidence levels of the algorithm it is possible to remove the log factor entirely. 
Building on UCB, the directly named ‘minimax optimal strategy in the stochastic 
case’ (MOSS) algorithm was the first to make this modification and is presented 
below. MOSS again depends on prior knowledge of the horizon, a requirement 
that may be relaxed, as we explain in the notes. 


The term minimax is used because, except for constant factors, the worst- 
case bound proven in this chapter cannot be improved on by any algorithm. 
The lower bounds are deferred to Part IV. 


The MOSS Algorithm 


Algorithm 7 shows the pseudocode of MOSS, which is again an instance of the 
UCB family. The main novelty is that the confidence level is chosen based on the 
number of plays of the individual arms, as well as n and k. 


1: Input nandk 
2: Choose each arm once 
3: Subsequently choose 


4 n 
A, = ar ; fis(t — 1 logt 
t argmax; fi; (t )+ la: 1) og Ge 5) ? 


where log*(a) = log max {1,2} . 


Algorithm 7: MOSS. 
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THEOREM 9.1. For any 1-subgaussian bandit, the regret of Algorithm 7 satisfies 
k 
Ra < 39Vkn + Y AG. 
i=1 


Before the proof we state and prove a strengthened version of Corollary 5.5. 


THEOREM 9.2. Let Xı, X2,..., Xn be a sequence of independent o-subgaussian 
random variables and S; = eae Xs. Then, for any £ > 0, 


2 
P (exists t < n : S; > £) < exp (5) ; (9.1) 


The bound in Eq. (9.1) is the same as the bound on P (Sn > £) that appears 
in a simple reformulation of Corollary 5.5, so this new result is strictly stronger. 


Proof From the definition of subgaussian random variables and Lemma 5.4, 


22 
i [exp (AS;,)] < exp (s ) : 
Then, choosing A = £/ (nag?) leads to 


P (exists t <n: S, > £) = P (max exp (AS;) > exp 0s) 


i [exp (ASn)] no? A? e? 
< —- ee a < = = — — r 
~  explàe) 7 oe 2 A P \ ong? 


The novel step is the first inequality, which follows from Doob’s submartingale 
inequality (Theorem 3.10) and the fact that that exp(\S;) is a submartingale 
with respect to the filtration generated by X1, X2,...,X» (Exercise 9.1). 


Before the proof of Theorem 9.1, we need one more lemma to bound the 
probability that the index of the optimal arm ever drops too far below the actual 
mean of the optimal arm. The proof of this lemma relies on a tool called the 
peeling device, which is an important technique in probability theory and has 
many applications beyond bandits. For example, it can be used to prove the 
celebrated law of the iterated logarithm. 


LEMMA 9.3. Let 6 € (0,1) and X1, X2,... be independent and 1-subgaussian and 
ħi = I5“ Xs. Then, for any A >Q, 


4 1 156 
P | exists s > 1: ñs +4/+log™| —)+A<0 gan, 
s så A? 
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Proof Let S; = tf. Then 


4 1 
p (este o> 1 | log*( z) | a<o) 
8 s 


1 
= P [oss s>1:S,+ Astos*( 5) +sA< o) 


sô 


< 5 P Go s € (27,2971): Ss + 4/4s we*(5) +sQ 


IA 
(= 
SL 


— 1 
<) P [oss s< tts S+ i! -25 tee" (sag 


) +2A < o) 
j=0 
a . 2 
< ( 25+? log *( zz$rg) + zA) 
< > exp 2T 
J= 


The first inequality follows from a union bound over a geometric grid. The second 
step is straightforward but important because it sets up to apply Theorem 9.2. 
The rest is purely algebraic: 


2 
5 ( 25+? logt( 5,4) + 2A) oo . 
a exp ED < 5X 9J+1 exp (—A?2)-?) 
j=0 j=0 
ts oo 156 
sre f 25+ exp ( =A *) ds aS Ae 


Above, the first inequality follows since (a + b)? > a? + b? for a,b > 0, and 
the second last step follows by noting that the integrand is unimodal and 
has a maximum value of 86/(eA?). For such functions f, one has the bound 


wo, F(A) < maxsefay f(s) + J? f(s)ds 


Proof of Theorem 9.1 As usual, we assume without loss of generality that the 
first arm is optimal, so uw, = u*. Arguing that the optimal arm is sufficiently 
optimistic with high probability is no longer satisfactory because in this refined 
analysis, the probability that an arm is played linearly often needs to depend 
on its suboptimality gap. A way around this difficulty is to make an argument 
in terms of the expected amount of optimism. Define a random variable A that 
measures how far below the index of the optimal arm drops below its true mean. 


pel eH) 


Arms with suboptimality gaps much larger than A will not be played too often, 
while arms with suboptimality gaps smaller than A may be played linearly often, 
but A is sufficiently small in expectation that this price is small. Using the basic 
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regret decomposition (Lemma 4.5) and splitting the actions based on whether or 
not their suboptimality gap is smaller or larger than 2A leads to 


t:A;,>0 


IA 


t:Ai>2A 


IA 


D | 2nA + 8Vkn + 5 AiT;(n) 
iAi>max{2A,8y/k/n} 


The first term is easily bounded using Proposition 2.8 and Lemma 9.3: 


U 2nA] = 2n [A] = 2n f P(A >a)de<2n | min {1,95 } de < 16Vkn. 
0 0 ng 


For suboptimal arm i, define 


u= DI fs $ ogt( 2) > wit aah. 


The reason for choosing «; in this way is that for arms i with A; > 2A, it holds 
that the index of the optimal arm is always larger than ju; + A;/2, so ki is an 
upper bound on the number of times arm i is played, T;(n). If A; > 8(k/n)'/?, 
then the expectation of A;r; is bounded using Lemma 8.2 by 


A; [ki] < —— = A; y 


f a 

on 

o 

[oje] 

CO 

+ 

bo 

5 
o 

[oje] 

+ 

+ 

Wicca 
= 

Ol 
$ 


where the first inequality follows by replacing the s in the logarithm with 1/A? 
and adding the A; x 1/A? correction term to compensate for the first Aj? 
rounds where this fails to hold. Then we use Lemma 8.2 and the monotonicity of 
at++>a—!-P log? (ax?) for p € [0,1], positive a and x > e/,/a. The last inequality 
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follows by naively bounding 1/8 + 4log8 + 2yr log8 + 1 < 15. Then 


i:Aj>max{2A,8/k/n} iAi>84/ k/n 
n 
< > (a152) 
A> 8B4/k/n 
k 
< 15vnk +) Aj. 
w=1 


Combining all the results we have Rn < 39v kn + + Ai. 


Two Problems 


MOSS is not the ultimate algorithm. Here we highlight two drawbacks. 


Suboptimality Relative to UCB 

Although MOSS is nearly asymptotically optimal (Note 1), all versions of MOSS 
can be arbitrarily worse than UCB in some regimes. This unpleasantness is hidden 
by both the minimax and asymptotic optimality criteria, which highlights the 
importance of fully finite-time upper and lower bounds. The counter-example 
witnessing the failure is quite simple. Let the rewards for all arms be Gaussian 
with unit variance and n = k?, p = 0, u2 = —y/k/n and u; = —1 for all i > 2. 
From Theorem 8.1, we have that 


RUCB = O(klogk), 
while it turns out that MOSS has a regret of 
RMOSS = Ok) SOK"). 


A rigourous proof of this claim is quite delicate, but we encourage readers to try 
to understand why it holds intuitively. 


Instability 

There is a hidden cost of pushing too hard to reduce the expected regret, which 
is that the distribution of the regret is less well-behaved. Consider a two-armed 
Gaussian bandit with suboptimality gap A. The random (pseudo) regret is 
R, = yor, Aa,, which for a carefully tuned algorithm has a roughly bimodal 
distribution: 


Ê x e with probability 6 


x log (4) otherwise, 
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where 6 is a parameter of the policy that determines the likelihood that the 
optimal arm is misidentified. Integrating, one has 


x 1 1 
Rn = E[R,] = O (rass a8 G) ; 


The choice of 6 that minimises the expected regret depends on A and is 
approximately 1/(nA?). With this choice, the regret is 


Rn=0 G (1 + log (na?) ) 


Of course A is not known in advance, but it can be estimated online so that the 
above bound is actually realisable by an adaptive policy that does not know A 
in advance (Exercise 9.3). Let F be the (informal) event that R, = Q(nA). The 
problem is that when 6 = 1/(nA?) is chosen to minimise the expected regret, 
then the second moment due to failure is 


[Ie R] = Q(n). 


On the other hand, by choosing ô = (nA)~?, the regret increases only slightly to 


R, =O (= (+ PIE (a?) ) ) 


The second moment of the regret due to failure, however, is E[I-R2] = O(1). 


Notes 
1 MOSS is quite close to asymptotically optimal. You can prove that 
Rn 4 
lim sup ——— < 4 
AP e0) $2, i: 
By modifying the algorithm slightly, it is even possible to replace the four 
with a two and recover the optimal asymptotic regret. The trick is to increase 
g slightly and replace the four in the exploration bonus by two. The major 
task is then to re-prove Lemma 9.3, which is done by replacing the intervals 
[27,2/*1] with smaller intervals [€/, €7++], where £ is tuned subsequently to be 
fractionally larger than one. This procedure is explained in detail by Garivier 
[2013]. When the reward distributions are actually Gaussian, there is a more 
elegant technique that avoids peeling altogether (Exercise 9.4). 


One way to mitigate the issues raised in Section 9.2 is to replace the index 
used by MOSS with a less aggressive confidence level: 


fus(t — 1) + lay logt (ae) i (9.2) 


The resulting algorithm is never worse than UCB, and you will show in 


bo 
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Exercise 9.3 that it has a distribution free regret of O(,/nklog(k)). An 
algorithm that does almost the same thing in disguise is called ‘improved 
UCB’, which operates in phases and eliminates arms for which the upper 
confidence bound drops below a lower confidence bound for some arm [Auer 
and Ortner, 2010]. This algorithm was the topic of Exercise 6.8. 

Overcoming the failure of MOSS to be instance optimal without sacrificing 
minimax optimality is possible by using an adaptive confidence level that tunes 
the amount of optimism to match the instance. One of the authors has proposed 
two ways to do this, using one of the following indices: 


Ailt —1) + TETT log (=) , or (9.3) 


w 


f(t —1) + 


2 n 
T(t — 1) i Ga — 1), VYT: - D(t = =} l 


The first of these algorithms is called the ‘optimally confident UCB’ [Lattimore, 
2015b] while the second is AdaUCB [Lattimore, 2018]. Both algorithms are 
minimax optimal up to constant factors and never worse than UCB. The 
latter is also asymptotically optimal. If the horizon is unknown, then AdaUCB 
can be modified by replacing n with t. It remains a challenge to provide a 
straightforward analysis for these algorithms. 


Bibliographic Remarks 


MOSS is due to Audibert and Bubeck [2009], while an anytime modification 
is by Degenne and Perchet [2016]. The proof that a modified version of MOSS 
is asymptotically optimal may be found in the article by Ménard and Garivier 
[2017]. There is also a variant of MOSS that adapts to the variance for rewards 
bounded in [0,1] [Mukherjee et al., 2018]. AdaUCB and its friends are by one of 
the authors [Lattimore, 2015b, 2016b, 2018]. The idea to modify the confidence 
level has been seen in several places, with the earliest by Lai [1987] and more 
recently by Honda and Takemura [2010]. Kaufmann [2018] also used a confidence 
level like in Eq. (9.2) to derive an algorithm based on Bayesian upper confidence 
bounds. 


Exercises 
9.1 (SUBMARTINGALE PROPERTY) Let X1, X2,..., Xn be adapted to filtration 


F = (Fi): with E[X: | 7:1] = 0 almost surely. Prove that M; = exp(A 5Y Xs) 
is a F-submartingale for any À € R. 


9.2 (PROBLEM-DEPENDENT BOUND) Let Amin = min;:^A;>0 Ai. Show there exists 
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a universal constant C > 0 such that the regret of MOSS is bounded by 


Ck nAŽ. & 
< F min a 
Ry, < log (4 ) + 2 A; 


min 


9.3 (UCB*) Suppose we modify the index used by MOSS to be 


7 4 n 
Bee: ma 7 ee" Gag 3) 7 


(a) Show that for all 1-subgaussian bandits, this new policy suffers regret at 


most 


PaO D Ate 


t:A;>0 


logt (nA?) | , 
where C > 0 is a universal constant. 


(b) Under the same conditions as the previous part, show there exists a universal 
constant C > 0 such that 


k 
Ry < Cy knlog(k) + 5 Ai. 
i=1 


(c) Repeat parts (a) and (b) using the index 


9.4 (GAUSSIAN NOISE AND THE TANGENT APPROXIMATION) Let g(t) = at +b 
with b > 0 and 


u(x,t) = 


1 P ( =) 1 r ( Dab a 
x x a 
V2xt PL D V2rt 7 2t 
(a) Show that u(x,t) > 0 for z € (—oo, g(t)) and u(x,t) = 0 for x = g(t). 
(b) Show that u(x,t) satisfies the heat equation: 
1 
Opu(a, t) = 5 Onu(@, t) . 


(c) Let B, be a standard Brownian motion, which for any fixed t has density 
with respect to the Lebesgue measure. 


(2,1) = Few (F) 

x,t) = —— exp | -= ) . 

R Vrt p 2t 

Define 7 = min{t : B; = g(t) } as the first time the Brownian motion hits the 
boundary. Put on your physicists hat (or work hard) to argue that 


P(T >t)= J u(x, t)dz. 


=00 
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(a) Let v(t a we ` a of time 7 with respect to the Lebesgue measure so 
that P (r, = fj v(t)dt. Show that 


ie “— exp ( i) 


(e) In the last part, you established the exact density of the hitting time of a 
Brownian motion approaching a linear boundary. We now generalise this 
to nonlinear boundaries, but at the cost that now we only have a bound. 
Suppose that f : [0,co) — [0,0o) is concave and differentiable, and let 
A: R > R be the intersection of the tangent to f at t with the y-axis given 
by A(t) = f(t) — tf' (t). Let T = min{t : B; = f(t)} and v(t) be the density 
of T. Show that for t > 0, 


X(t) ( f 0) 
v(t) < ex i 
ara a 
(£) Suppose that X1, X2,... is a sequence of independent standard Gaussian 
random variables. Show that 


p (esiste n: X 5x, > f(t )) < [3 he exp ( HO) at. 


s=l 


(g) Let h : (0,00) — (1,00) be a concave increasing function such that 


/ log(h(a))/h(a) < c/a for some constant c > 0 and f(t) = \/2tlog h(1/td)+ 
tA. Show that 


: 2cd 

P [oss t: 2 > ro) < TAE 

(h) Show that h(a) = 1 + (1 + a)y/log(1 + a) satisfies the requirements of the 
previous part with c = 11/10. 

(i) Use your results to modify MOSS for the case when the rewards are Gaussian. 
Compare the algorithms empirically. 

(j) Prove for your modified algorithm that 


HINT The above exercise has several challenging components and assumes 
prior knowledge of Brownian motion and its interpretation in terms of the heat 
equation. We recommend the book by Lerche [1986] as a nice reference on hitting 
times for Brownian motion against concave barriers. The equation you derived in 
Part (d) is called the Bachelier-Lévy formula , and the technique for doing 
so is the method of images. The use of this theory in bandits was introduced 
by one of the authors [Lattimore, 2018], which readers might find useful when 
working through these questions. 


5 (ASYMPTOTIC OPTIMALITY AND SUBGAUSSIAN NOISE) In the last exercise, 
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you modified MOSS to show asymptotic optimality when the noise is Gaussian. 
This is also possible for subgaussian noise. Follow the advice in the notes of this 
chapter to adapt MOSS so that for all 1-subgaussian bandits, it holds that 

R 


2 
lim sup —“~ < 5 5 
n> log(n) i:A;>0 Ai 


while maintaining the property that Rn < Cv kn for universal constant C > 0. 


10 


10.1 


The Upper Confidence Bound 
Algorithm: Bernoulli Noise (<) 


In previous chapters we assumed that the noise of the rewards was o-subgaussian 
for some known o > 0. This has the advantage of simplicity and relative generality, 
but stronger assumptions are sometimes justified and often lead to stronger results. 
In this chapter the rewards are assumed to be Bernoulli, which just means that 
X+ € {0,1}. This is a fundamental setting found in many applications. For 
example, in click-through prediction, the user either clicks on the link or not. A 
Bernoulli bandit is characterised by the mean pay-off vector u € (0, 1]* and the 
reward observed in round t is X; ~ B(j14,). 

The Bernoulli distribution is 1/2-subgaussian regardless of its mean 
(Exercise 5.12). Hence the results of the previous chapters are applicable, and an 
appropriately tuned UCB enjoys logarithmic regret. The additional knowledge 
that the rewards are Bernoulli is not being fully exploited by these algorithms, 
however. The reason is essentially that the variance of a Bernoulli random 
variable depends on its mean, and when the variance is small, the empirical mean 
concentrates faster, a fact that should be used to make the confidence intervals 
smaller. 


Concentration for Sums of Bernoulli Random Variables 


The first step when designing a new optimistic algorithm is to construct confidence 
sets for the unknown parameters. For Bernoulli bandits, this corresponds to 
analysing the concentration of the empirical mean for sums of Bernoulli random 
variables. For this, the following definition will prove useful: 


DEFINITION 10.1 (Relative entropy between Bernoulli distributions). The 
relative entropy between Bernoulli distributions with parameters p,q € [0, 1] is 


d(p, q) = plog(p/q) + (1 — p) log((1 — p)/(1 — )), 


where singularities are defined by taking limits: d(0,q) = log(1/(1 — q)) and 
d(1,q) = log(1/q) for q € [0,1] and d(p,0) = 0 if p = 0 and oo otherwise and 
d(p,1) = 0 if p= 1 and oo otherwise. 


= 
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More generally, the relative entropy or Kullback—Leibler divergence 
is a measure of similarity between distributions. See Chapter 14 for a generic 
definition, interpretation and discussion. 


LEMMA 10.2. Let p,q,¢ € [0,1]. The following hold: 


(a) The functions d(-,q) and d(p,-) are convex and have unique minimisers at q 
and p, respectively. 

(b) d(p,q) > 2(p — q)? (Pinsker’s inequality). 

(c) Ifp<q—e<4q, then d(p,q— £) < d(p, q) — d(q — £, 4) < d(p, q) — 22°. 


Proof We assume that p,q € (0,1). The corner cases are easily checked 
separately. Part (a): d(-,q) is the sum of the negative binary entropy function 
h(p) = plogp + (1 — p) log(1 — p) and a linear function. The second derivative 
of h is h”(p) = 1/p + 1/(1 — p), which is positive, and hence h is convex. For 
fixed p the function d(p,-) is the sum of h(p) and convex functions plog(1/q) and 
(1 — p) log(1/(1 — q)). Hence d(p,-) is convex. The minimiser property follows 
because d(p,q) > 0 unless p = q in which case d(p,p) = d(q,q) = 0. A more 
general version of (b) is given in Chapter 15. A proof of the simple version here 
follows by considering the function g(x) = d(p,p + x) — 2x”, which obviously 
satisfies g(0) = 0. The proof is finished by showing that this is the unique 
minimiser of g over the interval [—p, 1 — p]. The details are left to Exercise 10.1. 
For (c), notice that 


h(p) = d{p, q = €) ~ d(p, q) = plog =$ + (1 — p) log 


It is easy to see then that h is linear and increasing in its argument. Therefore, 
since p< q — €, 


h(p) < h(q—«) = -dq - £, 9), 


as required for the first inequality of (c). The second inequality follows by using 
the result in (b). 


The next lemma controls the concentration of the sample mean of a sequence 
of independent and identically distributed Bernoulli random variables. 


LEMMA 10.3 (Chernoff’s bound). Let X1, X2,..., Xn be a sequence of independent 
random variables that are Bernoulli distributed with mean u, and let ff = 
DD X; be the sample mean. Then, for e € [0,1 — u], it holds that 


P (Â > pte) < exp(—nd(u + £, u)) (10.1) 
and for £ € [0, u], 


P (ji < p — €) < exp (-nd(u — €, 4)) . (10.2) 
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Proof We will again use the Cramér—Chernoff method. Let A > 0 be some 
constant to be chosen later. Then, 


P(@>pte)=P (ox h 5 x- 0) > exp on) 


b [exp (A Xia (Xe — 4) 
exp (Ane) 


= (wexp(A(1 — u — €)) + (1 — u) exp(—A(u + €)))” . 


This expression is minimised by A = log ae Therefore, 


P(f > pte) 
wal Giat- y 
< (a (6420) A a) ) 


u (reda-w\"*\ 
eee 


= exp (—nd(u+ €, 4)) - 


The bound on the left tail is proven identically. 


Using Pinsker’s inequality, it follows that P(fi>wte),P(i<p-e) < 
exp(—2ne”), which is the same as what can be obtained from Hoeffding’s lemma 
(see (5.8)). Solving exp(—2ne) = 6, we recover the usual 1 — 6 confidence upper 
bound. In fact, this cannot be improved when p ~ 1/2, but the Chernoff bound 
is much stronger when is close to either zero or one. Can we invert the Chernoff 
tail bound to get confidence intervals that get tighter automatically as u (or fi) 
approaches zero or one? The following corollary shows how to do this. 


COROLLARY 10.4. Let u, fi,n be as above. Then, for any a > 0, 


and P(d({a, 


Furthermore, defining 


and L(a) = min{u € [0,1] : d(f, u) 
Then, P (u > U(a)) < exp(—na) and P (u < L(a)) < exp(—na). 


Proof First, we prove (10.3). Note that d(-, u) is decreasing on [0, u], and thus, 
for O <a < d(0, u), {d(ĝ, u) > a, < u} = {Â < n-z, < u}= {Â< u-i}, 
where z is the unique solution to d(u — x, p) = a on [0, u]. Hence, by Eq. (10.2) 
of Lemma 10.3, P(d(fi,u) > a,fi< ps) < exp(—na). When a > d(0, p), the 
inequality trivially holds. The proof of (10.4) is entirely analogous and hence 
is omitted. For the second part of the corollary, fix a and let U = U(a). 
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First, notice that U > fi and d(ji,-) is strictly increasing on [f,1]. Hence, 
{u > U} = {u > U, u > fp} = {4(Â, n) > d(A,U),u > A} = {d(A, u) 2 a,u > A}, 
where the last equality follows by d(fi,U) = a, which holds by the definition 
of U. Taking probabilities and using the first part of the corollary shows that 
P (u > U) < exp(—na). The statement concerning L = L(a) follows with a similar 
reasoning. 


Note that for ô € (0,1), U = U(log(1/6)/n) and L = L(log(1/5)/n) are upper 
and lower confidence bounds for u. Although the relative entropy has no closed- 
form inverse, the optimisation problem that defines U and L can be solved to a 
high degree of accuracy using Newton’s method (the relative entropy d is convex 
in its second argument). The advantage of this confidence interval relative to 
the one derived from Hoeffding’s bound is now clear. As fi approaches one, the 
width of the interval U(a) — fi approaches zero, whereas the width of the interval 
provided by Hoeffding’s bound stays at ,/log(1/d)/(2n). The same holds for 
fi— L(a) as i > 0. 


EXAMPLE 10.5. Fig. 10.1 shows a plot of d(3/4, x) and the lower bound given 
by Pinsker’s inequality. The approximation degrades as |x — 3/4| grows large, 
especially for x > 3/4. As explained in Corollary 10.4, the graph of d(ji,-) can 
be used to derive confidence bounds by solving for d(fi,x) = a = log(1/6)/n. 
Assuming fi = 3/4 is observed, a confidence level of 90 per cent with n = 10, 
a ~ 0.23. The confidence interval can be read out from the figure by finding 
those values where the horizontal dashed black line intersects the solid blue line. 
The resulting confidence interval will be highly asymmetric. Note that in this 
scenario, the lower confidence bounds produced by both Hoeffding’s inequality 
and Chernoff’s bound are similar, while the upper bound provided by Hoeffding’s 
bound is vacuous. 


0.6 


=== a=0.23 
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Figure 10.1 Relative entropy and Pinsker’s inequality 
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The KL-UCB Algorithm 


The difference between KL-UCB and UCB is that Chernoff’s bound is used to 
define the upper confidence bound instead of Lemma 5.5. 


1: Input k 
2: Choose each arm once 
3: Subsequently choose 


l 
A; = argmax; max {i € [0,1] : d(fii(t-— 1), @) < net , 


where f(t) = 1+ tlog?(t). 


Algorithm 8: KL-UCB. 


THEOREM 10.6. If the reward in round t is X; ~ B(ua,), then the regret of 
Algorithm 8 is bounded by 


ey m a(z pas), at ra). 


2 2 
: €1,€2>0 Hi + E1 pt — £2) 2e E 
i:A;>0 e1+E2€(0,4A;) i 1 2 


Rn A; 
Furthermore, lim su < ——. 
no log(n) D d( 
Comparing the regret in Theorem 10.6 to what would be obtained when using 
UCB from Chapter 8, which for subgaussian constant o = 1/2 satisfies 


Ry 1 
lim sup ———~ < —. 
noo log(n) Pe 2A; 
By Pinsker’s inequality (part (b) of Lemma 10.2) we see that d(pj,u*) > 
2(u* — ui)? = 2A?, which means that the asymptotic regret of KL-UCB is 
never worse than that of UCB. On the other hand, a Taylor’s expansion shows 
that when ju; and p* are close (the hard case in the asymptotic regime), 


T = —“! — + o(2) 
Hi, H — 2ui(l — pi) o i f 


indicating that the regret of KL-UCB is approximately 


lim sup x la TR, (10.5) 
1209 log(n) a. Ai 


Notice that j;(1 — ui) is the variance of a Bernoulli distribution with mean pi. 
The approximation indicates that KL-UCB will improve on UCB in regimes 
where pu; is close to zero or one. 

The proof of Theorem 10.6 relies on two lemmas. The first is used to show 
that the index of the optimal arm is never too far below its true value, while the 
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second shows that the index of any other arm is not often much larger than the 
same value. These results mirror those given for UCB, but things are complicated 
by the non-symmetric and hard-to-invert divergence function. 

For the next results, we define d(p, q) = d(p, gl {p < q}. 


LEMMA 10.7. Let X1, X2,..., Xn be independent Bernoulli random variables with 
mean u € [0,1], € > 0 and 


: ‘ log f(t) 
: SSP IN e 
T = min fı : a ad( fis, u — €) ; O>. 


2 


Then, E[r] < =. 


M 


Proof We start with a high-probability bound and then integrate to control the 
expectation. 


P(r >) <P(31<s<nzd(fau—e) > so) 
< P (dasu 2 > weit) 


S 


=>, (dao n- e) > EO p< pe) 


l t 
< P (aim > ‘oa SU) + 22”, fis < n) (Cc) of Lemma 10.2) 


< S exp ( 8 (26? wef) ) (Eq. (10.3) of Corollary 10.4) 


S Qf (te? 


To finish, we integrate the tail, 


oe 1 Om dt 2 
LIT = P > t)dt < < . 
a f eS a8).  r 


LEMMA 10.8. Let X1, X2,..., Xn be independent Bernoulli random variables with 
mean u. Further, let A > 0, a > 0 and define 


r= DIa u+ A) < =} 


a 1 
Th alk] < inf H ; 
en, En] 2€(0,A) (a +e,u+ A) =) 
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Proof Let £ € (0,A) and u=a/d(u+e,u+A). Then, 


Ie] = POP (dhaut A) < E) 


< OP (A, > te or d(u+£,u+ A) < 2) 


(d(-, 4+ A) is decreasing on [0, y + AJ) 


< u+ J exp (~sd(u + €, u)) (Lemma 10.3) 
s=1 
a 1 
< = 
d(u+e,u +A) diute,p) 
1 
< io F TA) + 32 (Pinsker’s inequality /Lemma 10.2 (b)) 


as required. 


Proof of Theorem 10.6 As in other proofs, we assume without loss of generality 
that uw, = u* and bound E[T;(n)] for suboptimal arms 7. To this end, fix a 
suboptimal arm i and let £1 + €2 € (0, A;) with both £; and €z positive. Define 


i P log f(t 
T = min f: : pmax d(frs, ta — £2) = ‘os < o} saua 
l 
K= POT Alios + Ai — ea) < so) 
s=1 s 


Using a similar argument as in the proof of Theorem 8.1, 


UIT; (n)] = |S (a=) 
<Ef]+E| X Ha =a) 
t=r+1 
< Efr] +E S: fa = i and d(Îi T;(t-1), Mi — €2) < ae } 
< E[r] + Efx] 
< 2 | f(n) o1 


ee d(mi +e, u*— E2) Ber? 


where the second inequality follows, since by the definition of 7, if t > 7, then 
the index of the optimal arm is at least as large as 41 — £2. The third inequality 
follows from the definition of « as in the proof of Theorem 8.1. The final inequality 
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follows from Lemmas 10.7 and 10.8. The first claim of the theorem is completed 
by substituting the above into the standard regret decomposition 


The asymptotic claim for you in Exercise 10.2. 


k 
Rn = A.E[T;(n)]. 


Notes 


1 


2 


w 
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The new concentration inequality (Lemma 10.3) holds more generally for 
any sequence of independent and identically distributed random variables 
X1, X2,..., Xn for which X; € [0,1] almost surely. Therefore all results in 
this section also hold if the assumption that the noise is Bernoulli is relaxed 
to the case where it is simply supported in [0,1] (or other bounded sets by 
shifting/scaling). 

Expanding on the previous note, all that is required is a bound on the moment- 
generating function for random variables X where, X € [0,1] almost surely. 
Garivier and Cappé [2011, Lemma 9] noted that f(x) = exp(Ax) — z(exp(à) — 
1) — 1 is negative on [0,1], and so 


E [exp(AX)] < E[X(exp(A) — 1) + 1] = wexp(A) +1- p, 


which is precisely the moment-generating function of the Bernoulli distribution 
with mean p. Then the remainder of the proof of Lemma 10.3 goes through 
unchanged. This shows that for any bandit v = (P;); with Supp(P;) € [0,1] for 
all i the regret of the policy in Algorithm 8 satisfies 


A; 
ay 


The bounds obtained using the argument in the previous note are not quite 
tight. Specifically one can show there exists an algorithm such that for all 
bandits v = (P;); with P;, the reward distribution of the ith arm supported on 
(0, 1], then 


li TL 
noc log(n) 


: n Ai 
lim sup ——~ = —, where 
n—0o log(n) i:A,>0 di 


d; = inf{D(P;, P) : u(P) > u* and Supp(P) c [0,1]} 


and D(P, Q) is the relative entropy between measures P; and P, which we 
define in Chapter 14. The quantity d; is never smaller than d(,1;, u*). For details 
on this, see the paper by Honda and Takemura [2010]. 

The approximation in Eq. (10.5) was used to show that the regret for KL-UCB 
is closely related to the variance of the Bernoulli distribution. It is natural to 
ask whether or not this result could be derived, at least asymptotically, by 
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appealing to the central limit theorem. The answer is no. First, the quality of 
the approximation in Eq. (10.5) does not depend on n, so asymptotically it is 
not true that the Bernoulli bandit behaves like a Gaussian bandit with variances 
tuned to match. The reason is that as n tends to infinity, the confidence level 
should be chosen so that the risk of failure also tends to zero. But the central 
limit theorem does not provide information about the tails with probability 
mass less than O(n~1/?). See Note 1 in Chapter 5. 

5 The analysis in this chapter is easily generalised to a wide range of alternative 
noise models. You will do this for single-parameter exponential families in 
Exercises 10.4, 10.5 and 34.5. 

6 Chernoff credits Lemma 10.3 to his friend Herman Rubin [Chernoff, 2014], but 
the name seems to have stuck. 
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optimal and asymptotically optimal for single-parameter exponential families. 
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Exercises 


10.1 (PINSKER’S INEQUALITY) Prove Lemma 10.2(b). 


Hint Consider the function g(x) = d(p, p+ £) — 2x? over the [—p, 1 — p] interval. 
By taking derivatives, show that g > 0. 


10.2 (ASYMPTOTIC OPTIMALITY) Prove the asymptotic claim in Theorem 10.6. 
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HINT Choose €1,¢€2 to decrease slowly with n and use the first part of the 
theorem. 


10.3 (CONCENTRATION FOR BOUNDED RANDOM VARIABLES) Let F = (F;); bea 
filtration, (X+); be [0, 1]-valued, F-adapted sequence, such that E [X; | F:—1] = be 
for some u1,...,Hn € [0,1] non-random numbers. Define p = + >>}, Me, 
p= + yo, Xt. Prove that the conclusion of Lemma 10.3 still holds. 


Hint Read Note 2 at the end of this chapter. Let g(-, u) be the cumulant- 
generating function of the -parameter Bernoulli distribution. For X ~ B(y), 
AER, g(A, u) = log E [exp(AX)]. Show that g(A,-) is concave. Next, use this and 
the tower rule to show that E [exp(An(fi — u))] < g(à, p)”. 


The bound of the previous exercise is most useful when all ju; are either all 
close to zero or they are all close to one. When half of the {u+} are close to 
zero and the other half close to one, then the bound degrades to Hoeffding’s 
bound. 


10.4 (KL-UCB FOR EXPONENTIAL FAMILIES) Let M = {Py : 0 € O} bea 
regular non-singular exponential family with sufficient statistic S(x) = x and 
E = {(Po,)£, : 0 € OF} be the set of bandits with reward distributions in M. 
Design a policy m such that for all v € E, it holds that 


where (0) = fg dPo(x) is the mean of Po and dying = inf{d(0,¢) : (o) > 
u*, @ € O}, with d(6, d) the relative entropy between Py and Py. 


Hint Readers not familiar with exponential families should skip ahead to 
Section 34.3.1 and then do Exercise 34.5. For the exercise, repeat the proof of 
Theorem 10.6, adapting as necessary. See also the paper by Cappé et al. [2013]. 


10.5 (KL-UCB FOR NON-CANONICAL EXPONENTIAL FAMILIES) Repeat the 
previous exercise, but relax the assumption that S(x) = x. 


HINT This is a subtle problem. You should adapt the algorithm so that if there 
are ties in the upper confidence bounds, then an arm with the largest number of 
plays is chosen. A solution is available. Korda et al. [2013] analysed Thompson 
sampling in this setting. Their result only holds when 0 ++ f, xpo(x)dh(x) is 
invertible, which does not always hold. 
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In the analysis of KL-UCB for canonical exponential families, the asymptotic 
rate is a good indicator of the finite-time regret in the sense that the o(log(n)) 
term hidden by the asymptotics has roughly the same leading constant as 
the dominant term. By contrast, the analysis here indicates that 


[a(n] = SEW) 


SS SS 


di ing hs scotia 


rh 


where di min = di,min(0). Although the latter term is negligible asymptotically, 
it may be the dominant term for all reasonable n. 


10.6 (COMPARISON TO UCB) In this exercise, you compare KL-UCB and UCB 
empirically. 


(a) Implement Algorithm 8 and Algorithm 6, where the latter algorithm should 
be tuned for 1/2-subgaussian bandits so that 


Ay = argmaXx;e jp] fii(t — 1) 4 ae € 7 l 


(b) Let n = 10000 and k = 2. Plot the expected regret of each algorithm as a 
function of A when pı = 1/2 and fo = 1/2 + A. 

(c) Repeat the above experiment with pı = 1/10 and u = 9/10. 

(d) Discuss your results. 


Part III 


Adversarial Bandits with 
Finitely Many Arms 
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Statistician George E. P. Box is famous for writing that ‘all models are wrong, 
but some are useful’. In the stochastic bandit model the reward is sampled from 
a distribution that depends only on the chosen action. It does not take much 
thought to realise this model is almost always wrong. At the macroscopic level 
typically considered in bandit problems, there is not much that is stochastic 
about the world. And even if there were, it is hard to rule out the existence of 
other factors influencing the rewards. 

The quotation suggests we should not care whether or not the stochastic bandit 
model is right, only whether it is useful. In science, models are used for predicting 
the outcomes of future experiments, and their usefulness is measured by the 
quality of the predictions. But how can this be applied to bandit problems? What 
predictions can be made based on bandit models? In this respect, we postulate 
the following: 


The point of bandit models is to facilitate predicting the performance of 
bandit algorithms on future problem instances that one encounters in their 
practice. 


A model can fail in two fundamentally different ways. It can be too specific, 
imposing assumptions so detached from reality that a catastrophic mismatch 
between actual and predicted performance may arise. The second mode of failure 
occurs when a model is too general, which makes the algorithms designed to do 
well on the bandit model overly cautious, which can harm performance. 

Not all assumptions are equally important. It is a critical assumption in 
stochastic bandits that the mean reward of individual arms does not change 
(significantly) over time. On the other hand, the assumption that a single, arm- 
dependent distribution generates the rewards for a given arm plays a relatively 
insignificant role. The reader is encouraged to think of cases when the constancy 
of arm distributions plays no role, and also of cases when it does — furthermore, to 
decide to what extent the algorithms can tolerate deviations from the assumption 
that the means of arms stay the same. Stochastic bandits where the means of 
the arms are changing over time are called non-stationary and are the topic of 
Chapter 31. 

If a highly specialised model is actually correct, then the resulting algorithms 
usually dominate algorithms derived for a more general model. This is a general 
manifestation of the bias-variance trade-off, well known in supervised learning 
and statistics. The holy grail is to find algorithms that work ‘optimally’ across 
a range of models. The reader should think about examples from the previous 
chapters that illustrate these points. 

The usefulness of the stochastic model depends on the setting. In particular, 
the designer of the bandit algorithm must carefully evaluate whether stochasticity, 
stability of the mean and independence are reasonable assumptions. For some 
applications, the answer will probably be yes, while in others the practitioner 
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may seek something more robust. This latter situation is the topic of the next 
few chapters. 


Adversarial Bandits 


The adversarial bandit model abandons almost all the assumptions on how 
the rewards are generated, so much so that the environment is often called the 
adversary. The adversary has a great deal of power in this model, including the 
ability to examine the code of the proposed algorithms and choose the rewards 
accordingly. All that is kept from the previous chapters is that the objective will 
be framed in terms of how well a policy is able to compete with the best action 
in hindsight. 

At first sight, it seems remarkable that one can say anything at all about such 
a general model. And yet it turns out that this model is not much harder than 
the stochastic bandit problem. Why this holds and how to design algorithms that 
achieve these guarantees will be explained in the following chapters. 

To give you a glimmer of hope, imagine playing the following simple bandit 
game with a friend. The horizon is n = 1, and you have two actions. The game 
proceeds as follows: 


1 You tell your friend your strategy for choosing an action. 

2 Your friend secretly chooses rewards xı € {0,1} and x2 € {0,1}. 

3 You implement your strategy to select A € {1,2} and receive reward x4. 
4 The regret is R = max{z1, £2} — TA. 


Clearly, if your friend chooses 7; = x2, then your regret is zero no matter what. 
Now let’s suppose you implement the deterministic strategy A = 1. Then your 
friend can choose x; = 0 and x2 = 1, and your regret is R = 1. The trick to 
improve on this is to randomise. If you tell your friend, ‘I will choose A = 1 with 
probability one half’, then the best she can do is choose xı = 1 and z2 = 0 (or 
reversed), and your expected regret is R = 1/2. You are forgiven if you did not 
settle on this solution yourself because we did not tell you that a strategy may 
be randomised. With such a short horizon, you cannot do better than this, but 
for longer games the relative advantage of the adversary decreases, as we shall 
see soon. 

In the next two chapters, we investigate the k-armed adversarial model in detail, 
providing both algorithms and regret analysis. Like the stochastic model, the 
adversarial model has many generalisations, which we’ll visit in future chapters. 


Bibliographic Remarks 


The quote by George Box was used several times with different phrasings [Box, 
1976, 1979]. The adversarial framework has its roots in game theory, with familiar 
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names like Hannan [1957] and Blackwell [1954] producing some of the early 
work. The non-statistical approach has enjoyed enormous popularity since the 
1990’s and has been adopted wholeheartedly by the theoretical computer science 
community [Vovk, 1990, Littlestone and Warmuth, 1994, and many, many others]. 
The earliest work on adversarial bandits is by Auer et al. [1995]. There is now a 
big literature on adversarial bandits, which we will cover in more depth in the 
chapters that follow. There has been a lot of effort to move away from stochastic 
assumptions. An important aspect of this is to define a sense of regularity for 
individual sequences. We refer the reader to some of the classic papers by Martin- 
Löf [1966] and Levin [1973] and the more recent paper by Ivanenko and Labkovsky 
[2013]. 
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11.1 


The Exp3 Algorithm 


In this chapter we first introduce the formal model of adversarial bandit 
environments and discuss the relationship to the stochastic bandit model. This is 
followed by the discussion of importance-weighted estimation, the Exp3 algorithm 
that uses this technique and the analysis of the regret of Exp3. 


Adversarial Bandit Environments 


Let k > 1 be the number of arms. A k-armed 
adversarial bandit is an arbitrary sequence of 
reward vectors (x;)7_,;, where x; € [0,1]*. In each 
round, the learner chooses a distribution over the 
actions P, € Pr-1. Then the action A; € [k] is 
sampled from P;, and the learner receives reward 
zia. The interaction protocol is summarised in 
Fig. 11.2. 
A policy in this setting is a function m : ([k] x SS44S8455485548 

(0, 1])* > P,-1 mapping history sequences to dis- 

tributions over actions (regardless of measurability). Figure 11.1 Would you play 
The performance of a policy 7 in environment x is With this multi-armed bandit? 
measured by the expected regret, which is the expected loss in revenue of the 
policy relative to the best fixed action in hindsight. 


R,(1, x) E- ) ty — E È ny ; (11.1) 
ve 
t= t=1 


Adversary secretly chooses rewards (x+); with zs € [0,1]* 
For rounds t = 1,2,...,n: 
Learner selects distribution P, € Px—1 and samples A; from P;. 


Learner observes reward X; = £tA;. 


Figure 11.2 Interaction protocol for k-armed adversarial bandits. 
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where the expectation is over the randomness of the learner’s actions. The 
arguments 7 and x are omitted from the regret when they are clear from context. 


The only source of randomness in the regret comes from the randomness in 
the actions of the learner. Of course the interaction with the environment 
means the action chosen in round t may depend on actions s < t as well as 
the observed rewards until round t. As we noted, unlike the case of stochastic 
bandits, here, there is no measurability restriction on the learner’s policy m. 
This is actually by choice, see Note 12 for details. 


The worst-case regret over all environments is 


Ri(r)= sup R,(7,2). 
xeE[0,1]"x* 
The main question is whether or not there exist policies 7 for which Rž (r) is 
sublinear in n. In Exercise 11.2 you will show that for deterministic policies 
R(t) > n(1 —1/k), which follows by constructing a bandit so that x4, = 0 for 
all t and zy = 1 for i Æ A. Because of this, sublinear worst-case regret is only 
possible by using a randomised policy. 


Readers familiar with game theory will not be surprised by the need for 
randomisation. The interaction between learner and adversarial bandit can be 
framed as a two-player zero-sum game between the learner and environment. 
The moves for the environment are the possible reward sequences, and for 
the player they are the policies. The pay-off for the environment /learner is 
the regret and its negation respectively. Since the player goes first, the only 
way to avoid being exploited is to choose a randomised policy. 


While stochastic and adversarial bandits seem quite different, it turns out that the 
optimal worst-case regret is the same up to constant factors and that lower bounds 
for adversarial bandits are invariably derived in the same manner as for stochastic 
bandits (see Part IV). In this chapter, we present a simple algorithm for which 
the worst-case regret is suboptimal by just a logarithmic factor. First, however, 
we explore the differences and similarities between stochastic and adversarial 
environments. 

We already noted that deterministic strategies will have linear regret for 
some adversarial bandit. Since strategies in Part II like UCB and ‘Explore-then- 
Commit’ were deterministic, they are not well suited for the adversarial setting. 
This immediately implies that policies that are good for stochastic bandit can 
be very suboptimal in the adversarial setting. What about the other direction? 
Will an adversarial bandit strategy have small expected regret in the stochastic 
setting? Let m be an adversarial bandit policy and v = (,...,%) be a stochastic 
bandit with Supp(v;) C [0,1] for all i. Next, let Xs be sampled from v; for each 
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i € [k] and t € [n], and assume these random variables are mutually independent. 
By Jensen’s inequality and convexity of the maximum function, we have 


< 6 | max (Xii — Xia) 
i€[k] FER 
= E[Rn(7, X)] < R3 (7), (11.2) 


where the regret in the first line is the stochastic regret (using the random table 
model), and in the last it is the adversarial regret. Therefore the worst-case 
stochastic regret is upper-bounded by the worst-case adversarial regret. Going 
the other way, the above inequality also implies that the worst-case regret for 
adversarial problems is lower-bounded by the worst-case regret on stochastic 
problems with rewards bounded in [0,1]. In Chapter 15, we prove that the worst- 
case regret for stochastic Bernoulli bandits is at least cV/nk, where c > 0 is a 
universal constant (Exercise 15.4). And so for the same universal constant, the 
minimax regret for adversarial bandits satisfies 
Re =inf sup Rna(a,x) >cVnk. 
T xeE[0,1]"** 

There is a little subtlety here. In order to define the expectations in the stochastic 
regret, the policy should be appropriately measurable. This can be resolved by 
noting that lower bounds can be proven using Bernoulli bandits. For details, see 
again Note 12. 


Importance-Weighted Estimators 


A key ingredient of all adversarial bandit algorithms is a mechanism for estimating 
the reward of unplayed arms. Recall that P, is the conditional distribution of the 
action played in round t, and so for i € [|k], P,; is the conditional probability 


Pa = P (A4 = i | A1, X1,- -p At—1, eA) k 


In what follows, we assume that for all t and i, P,; > 0 almost surely. As we 
shall see later, this will be true for all policies considered in this chapter. The 
importance-weighted estimator of £y is 


i 11.3 
i Pri ate) 


Let E|] = E|- | A41, X1,..., Ay, X] denote the conditional expectation given the 
history up to time t. The conditional mean of X+; satisfies 


“t [Xu] = Tti, (11.4) 
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which means that Ra is an unbiased estimate of x+ conditioned on the history 
observed after t — 1 rounds. To see why Eq. (11.4) holds, let An = 1{A; = i} so 
that XA = LtiÅti and 


PÀ Ati 
Ay = i 
i Py Á 
Now, Ey—-1[Aui] = Pr, and since P,; is o(A1, X1,..., At—1, Xt—1)-measurable, 
A Ati Tti Tti 
+11 til ia [Ft u P; t—1[Ati] Pa t Tt 


Being unbiased is a good start, but the variance of an estimator is also important. 
For arbitrary random variable U, the conditional variance V¿—1[U] is the random 
variable 


Vi-1[U] = Ex-1 [(U — Ex-1[U])’] . 


So Vea [LXa] is a random variable that measures the variance of X;; conditioned 
on the past. Calculating the conditional variance using the definition of Xy and 
Eq. (11.4) shows that 


(11.5) 


% v ~ Atiz? x. 1 — Pi; 
Vi-1[Xu] = J i[X2] — z2, = u Ji t J 2 — ti ( ti) 


P i D Pii 
This can be extremely large when P; is small and z+; is bounded away from zero. 
In the notes and exercises, we shall see to what extent this can cause trouble. 
The estimator in (11.3) is the first that comes to mind, but there are alternatives. 
For example, 

I{A =i} 
Pu 
This estimator is still unbiased. Rewriting the formula in terms of yn = 1 — Tti 

and Y; = 1 — X; and Êu =]— x leads to 

ti 
This is the same as (11.3) except that Y, has replaced X;. The terms yti, Y; and 
Y;; should be interpreted as losses. Had we started with losses to begin with, then 
this would have been the estimator that first came to mind. For obvious reasons, 
the estimator in Eq. (11.6) is called the loss-based importance-weighted 


estimator. The conditional variance of this estimator is essentially the same as 
Eq. (11.5): 


Xy=1 


cies (11.6) 


Yi; 


L= Pi 


Via] = V.[Yu] = Vin P 
ti 


The only difference is that the variance now depends on y?, rather than z2;. Which 
is better depends on the rewards for arm 7, with smaller rewards suggesting the 
superiority of the first estimator and larger rewards (or small losses) suggesting 
the superiority of the second estimator. Can we change the estimator (either one 
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of them) so that it is more accurate for actions whose reward is close to some 
specific value v? Of course! Just change the estimator so that v is subtracted 
from the observed reward (or loss), then use the importance-sampling formula, 
and subsequently add back v. The problem is that the optimal value of v depends 
on the unknown quantity being estimated. Also note that the dependence of the 
variance on Py; is the same for both estimators, and since the rewards are bounded, 
it is this term that usually contributes most significantly. In Exercise 11.5, we ask 
you to show that all unbiased estimators in this setting are importance-weighted 
estimators. 


Although the two estimators seem quite similar, it should be noted that the 
first estimator takes values in [0, o0) while the second takes values in (—oo, 1]. 
Soon we will see that this difference has a big impact on the usefulness of 
these estimators when used in the Exp3 algorithm. 


The Exp3 Algorithm 


The simplest algorithm for adversarial bandits is called Exp3, which stands 
for ‘exponential-weight algorithm for exploration and exploitation’. The reason 
for this name will become clear after the explanation of the algorithm. Let 
oe = ys Ka be the total estimated reward by the end of round t, where Xa is 
given in Eq. (11.6). It seems natural to play actions with larger estimated reward 
with higher probability. While there are many ways to map Ŝi into probabilities, 
a simple and popular choice is called exponential weighting, which for tuning 
parameter 7 > 0 sets 


exp(75+_1,4) 


Pay = k ZR . 
Dai exp(7St-1,;) 


(11.7) 


The parameter ņ is called the learning rate. When the learning rate is large, P; 
concentrates about the arm with the largest estimated reward and the resulting 
algorithm exploits aggressively. For small learning rates, P; is more uniform, 
and the algorithm explores more frequently. Note that as P, concentrates, the 
variance of the importance-weighted estimators for poorly performing arms 
increases dramatically. There are many ways to tune the learning rate, including 
allowing it to vary with time. In this chapter we restrict our attention to the 
simplest case by choosing 7 to depend only on the number of actions k and the 
horizon n. Since the algorithm depends on 7, this means that the horizon must 
be known in advance, a requirement that can be relaxed (see Note 10). 
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1: Input: n, k,n 

2: Set Ŝo; = 0 for all i 

3: for t=1,...,n do 

4 Calculate the sampling distribution P,: 


exp (nS:-1,:) 
Da exp (nêi-15) 


5: Sample A, ~ P, and observe reward X; 
6: Calculate S: 


Py = 


I{A, = i} (1- X) 


Sa = Stag t1 
t t-1, Pa 


7: end for 


Algorithm 9: Exp3. 


Regret Analysis 


We are now ready to bound the expected regret of Exp3. 


THEOREM 11.1. Let x € [0,1]"** and m be the policy of Exp3 (Algorithm 9) with 
learning rate n = \/log(k)/(nk). Then, 


Ry, (7,2) < 2\/nklog(k). 


As we will prove many variants of this result with various tools, here we give a 
short algebraic proof, saving the development of intuition for later. 


Proof For any arm i, define 


Rni = ae -E Sx ; 
t=1 t=1 


which is the expected regret relative to using action 7 in all the rounds. The 
result will follow by bounding R,,; for all i, including the optimal arm. For the 
remainder of the proof, let i be some fixed arm. By the unbiasedness property of 
the importance-weighted estimator Dare 


n k k 
Dl ni] = 5 Ti and also u1 [X] = 5 Putty = 5 PaE Re]. 
t=1 j=l i=1 


(11.8) 
The tower rule says that for any random variable X, E[E;—1[X]] = E[X], which 
together with the linearity of expectation and Eq. (11.8) means that 
n k = 
Rui =E Ea -E 5 Y Pe Xs} =E [ôni - Sn], (11.9) 
t=1 i=1 j 
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where the last equality serves as the definition of În = es Pi. To bound 
the right-hand side of Eq. (11.9), let 


W; = vs (nâu) : 
j=1 


By convention an empty sum is zero, which means that So; = 0 and Wo = k. 
Then, 


k n 
WwW, Wha W, 
<J ex (nnz) =W, = Wg suri =k . (11.10 
exp(7 = p(n a) Oty, Wo II Wi ( ) 


The ratio in the product can be rewritten in terms of P; by 


Fx k 
exp ^ 
am Lsy exp(n5t—1,3) exp(7X1j) = X Pij exp(Xiy) - (11.11) 


j=1 j=1 


We need the following facts: 
exp() <1+a+2? foralla<1 and 1+2<exp(z) foral ze R. 


Using these two inequalities leads to 


k k 
— S149) oat | Pu Xt, 
a j=l j=l 
k k 
<exp | nX Pyy t+? D> Py XZ, | (11.12) 
j=l j=l 


Notice that this was only possible because Ñ+; is defined by Eq. (11.6), which 
ensures that X;; < 1 and would not have been true had we used Eq. (11.3). 
Combining Eq. (11.12) and Eq. (11.10), 


n k 
exp (nni) < kexp | nfn +n XY Pip X?, 
t=1 j=1 


Taking the logarithm of both sides, dividing by 7 > 0 and reordering gives 


n k 

a @ _ loglk a 

Sni — Sn < s ) n> > PyR}. (11.13) 
t=1 j=1 


As noted earlier, the expectation of the left-hand side is R,;. The first term on 
the right-hand side is a constant, which leaves us to bound the expectation of the 
second term. Letting yj = 1 — xj and Y; = 1 — X, and expanding the definition 
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of x? leads to 


k k “ 2 
A I{A; = 
S| S Py? py (1-H Bas) 
j=l 1 


j= Pij 
i plai o MA = j} Yy l HA = j} v2, 
oY) T 
1 Pry PP 


j= 


k ° 2 
IHA = j} yi 
Gt—1 So 


1—2Y,+ P 
tj 


j=1 
k 
1— 2%, +S) vj 


j=l 


(l1—¥%)? + 5 Vij 


JFAt 


II 
d =] =} =] d 
EE —— i rr a 


<k. 


Summing over t, and then substituting into Eq. (11.13), we get 


Rai < PSO) 4 nk = 24/nblog®) , 
n 


where the equality follows by substituting 7 = y/log(k)/(nk), which was chosen 
to optimise this bound. 


At the heart of the proof are the inequalities: 
1 +z <exp(z) for all x € R and exp(z) <1+a+2? forz <1. 


The former of these inequalities is an ansatz derived from the first-order Taylor 
expansion of exp(x) about x = 0. The latter, however, is not the second-order 
Taylor expansion, which would be 1 + x + 27/2. The problem is that the second- 
order Taylor series is not an upper bound on exp(a) for x < 1, but only for 
x<0: 


1 
exp(z) <14+a+4+ se for alla <0. (11.14) 


But it is nearly an upper bound, and this can be exploited to improve the bound 
in Theorem 11.1. The mentioned upper and lower bounds on exp(x) are shown 
in Fig. 11.3, from which it is quite obvious that the bound in Eq. (11.14) is 
significantly tighter when x < 0. 

Let us now put Eq. (11.14) to use in proving the following improved version of 
Theorem 11.1, for which the regret is smaller by a factor of v2. The algorithm is 
unchanged except for a slightly increased learning rate. 
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Figure 11.3 Approximations for exp(x) on [—1/2, 1/2]. 


THEOREM 11.2. Let x € [0,1]"** be an adversarial bandit and m be the policy of 


Exp3 with learning rate n = \/2log(k)/(nk). Then, 


R,(7,2) < V2nklog(k). 


Proof By construction, Rij < 1. Therefore, 


exp (nus) = exp(n) exp (Ru z 1)) 


T (Ruy - 13} ; 


Using the fact that >> j Pij = 1 and the inequality 1 + x < exp(x), we get 


< exp(n) f1 tRy- 1) + 


k k 2 k 
a oy 17 xP(NX1y) < exp | nX Pip Xj + Pal Xj — 1)" | , 
t—1 


= j=1 j=l 


where the equality is from Eq. (11.11). We see that here we need to bound 
E; Plij — 1)?. Let Ýi; = 1 — Åj. Then, 
Pij (Xey = 1)? = = Pj Pij Yi = =1{4 = Tyg < Ý; , 


where the last inequality used Y;; > 0 and y%; <1. Thus, 


k 
SP Xu; - 1) 2a 


With the same calculations as before, we get 


k 
N Yaa (11.15) 


E 
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The result is completed by taking expectations of both sides, using DD j Yi; = 


y tj ve iYy =Z a Ytj < nk and substituting the learning rate. 


The reader may wonder about the somewhat ad hoc proof. The best we 
can do for now is to point out a few things about the proof. It is natural to 
replace the true rewards with the estimated ones. Then, to prove a regret 
bound in terms of the estimated rewards, an alternative to the proof is 
to start with the the trivial inequality that states that for any x = (x;) 
vector and positive quantity 7, the inequality x; < ; log >; exp(nz;) holds. 
Applying this with x = (S;,;) gives 


ee A 1 
STE FLOD exp(nSnj)) = pe) 
j 


from where the proof can be continued by introducing the telescoping 
argument. 


Notes 


1 Exp3 is nearly optimal in the sense that its expected regret cannot be improved 


significantly in the worst case. The distribution of its regret, however, is very far 
from optimal. Define the random regret to be the random variable measuring 
the actual deficit of the learner relative to the best arm in hindsight: 


Aa a Dm . 


t= 


in terms of rewards in terms of losses 


In Exercise 11.6 you will show that for all large enough n and reasonable 
choices of 7, there exists a bandit such that the random regret of Exp3 satisfies 
P(Rn > n/4) > 1/131. In the same exercise, you should explain why this does 
not contradict the upper bound. That Exp3 has such a high variance is a 
serious limitation, which we address in the next chapter. 

What happens when the range of the rewards is unbounded? This has been 
studied by Allenberg et al. [2006], where some (necessarily much weaker) 
positive results are presented. 

In the full information setting, the learner observes the whole vector 
xz € [0,1]* at the end of round t, but the reward is still a;4,. This setting is 
also called prediction with expert advice. Exponential weighting is still 
a good idea, but the estimated rewards can now be replaced by the actual 
rewards. The resulting algorithm is sometimes called Hedge or the exponential 
weights algorithm. The proof as written goes through in almost the same way, 
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but one should replace the polynomial upper bound on exp(a) with Hoeffding’s 
lemma. This analysis gives a regret of ,/nlog(k)/2, which is optimal in an 
asymptotic sense [Cesa-Bianchi and Lugosi, 2006]. 


We assumed that the adversary chooses the rewards at the start of the game. 
Such adversaries are called oblivious. An adversary is called reactive or 
non-oblivious if x; is allowed to depend on the history 71, A1,...,%4~1, Ag_1. 
Despite the fact that this is clearly a harder problem, the result we obtained 
can be generalised to this setting without changing the analysis. It is another 
question whether the definition of regret makes sense for reactive environments. 


A more sophisticated algorithm and analysis shaves a factor of \/log(k) from 
the regret upper bound in Theorem 11.2 [Audibert and Bubeck, 2009, 2010a, 
Bubeck and Cesa-Bianchi, 2012]. It turns out that this algorithm, just like 
Exp3, is an instantiation of mirror descent from convex optimisation, which 
we present in Chapter 28. More details are in Exercise 28.15. Interestingly, 
this algorithm not only shaves off the extra \/log(k) factor from the regret, 
but also achieves O(log(n))-regret in the stochastic setting provided that one 
uses a learning rate of 1/vt in round t [Zimmert and Seldin, 2019]. This 
remarkable result improves in an elegant way on many previous attempts to 
design algorithms for stochastic and adversarial bandits [Bubeck and Slivkins, 
2012, Seldin and Slivkins, 2014, Auer and Chiang, 2016, Seldin and Lugosi, 
2017]. There are some complications, however, depending on whether or not the 
adversary is oblivious. The situation is best summarised by Auer and Chiang 
[2016], where the authors present upper and lower bounds on what is possible 
in various scenarios. 


The initial distribution (the ‘prior’) P, does not have to be uniform. By biasing 
the prior towards a specific action, the regret can be reduced when the favoured 
action turns out to be optimal. There is an unavoidable price for this, however, 
if the optimal arm is not favoured [Lattimore, 2015a]. 


Building on the previous note, suppose the reward in round t is X,; = 
fi(Ai,..., Ae) and fi,..., fn are a sequence of functions chosen in advance by 
the adversary with f: : [k]’ — [0,1]. Let II C [k]” be a set of action sequences. 
Then the expected policy regret with respect to II is 


max 2 fela, pias , at) Z6 bs falı, pai , At) 
t=1 


Even if II only consists of constant sequences, there still does not exist a policy 
guaranteeing sublinear regret. The reason is simple. Consider the two candidate 
choices of f1,...,fn- In the first choice, f:(a1,...,a4) = I{a, = 1}, and in 
the second we have f:(a1,..., at) =I {a = 2}. Clearly the learner must suffer 
linear regret in at least one of these two reactive bandit environments. The 
problem is that the learner’s decision in the first round determines the rewards 
available in all subsequent rounds, and there is no time for learning. By making 


10 


11 


12 
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additional assumptions, sublinear regret is possible, however — e.g. by assuming 
the adversary has limited memory [Arora et al., 2012]. 

There is a common misconception that the adversarial framework is a good fit 
for non-stationary environments. While the framework does not assume the 
rewards are stationary, the regret concept used in this chapter has stationarity 
built in. A policy designed for minimising the regret relative to the best action 
in hindsight is seldom suitable for non-stationary bandits, where the whole point 
is to adapt to changes in the optimal arm. In such cases a better benchmark is 
to compete with a sequence of actions. For more on non-stationary bandits, 
see Chapter 31. 

The estimators in Eq. (11.3) and Eq. (11.6) both have conditional variance 
Vi EA = 1/Pu, which blows up for small P;;. It is instructive to think about 
whether and how P, can take on very small values. Consider the loss-based 
estimator given by (11.6). For this estimator, when P,a, and X, are both 
small, X; A, can take on a large negative value. Through the update formula 
(11.7), this then translates into P;+1,4, being squashed aggressively towards 
zero. A similar issue arises with the reward-based estimator given by (11.3). 
The difference is that now it will be a ‘positive surprise’ (P;4, small, X; 
large) that pushes the probabilities towards zero. But note that in this case, 
Pii1, is pushed towards zero for all i # A+. This means that dangerously 
small probabilities are expected to be more frequent for the gains estimator 
Eq. (11.3). 

Exp3 requires advance knowledge of the horizon. The doubling trick can be 
used to overcome this issue, but a more elegant solution is to use a decreasing 
learning rate. The analysis in this chapter can be adapted to this case. More 
discussion is provided in the notes and exercises of Chapter 28, where we give 
a more generic solution to this problem (Exercise 28.13). 

The calculation in Eq. (11.2) is a reduction, showing that algorithms with low 
regret on finite-armed adversarial bandits also have low regret on stochastic 
bandits where the reward distributions have appropriately bounded support. 
Reductions play an important role throughout the bandit literature and we will 
see many more examples. The reader should be careful not to generalise the idea 
that adversarial algorithms work well on stochastic problems. The assumptions 
must be checked (like boundedness of the support), and for different models 
there can be subtleties. The whole of Chapter 29 is devoted to the linear case. 
As we mentioned, a policy for k-armed adversarial bandits is defined by any 
function m : ([k] x [0,1])* — P,-1. There is no need to assume that 7 is 
measurable because the actions are discrete and the rewards are deterministic. 
The relations between the stochastic and adversarial regret are only well defined 
for policies that are probability kernels as defined in Definition 4.7. You might 
be worried that lower bounds for stochastic bandits only imply lower bounds for 
measurable adversarial policies. Fortunately, the lower bounds are easily proven 
for Bernoulli bandits, and in this case the space of reward sequences is finite 
and measurability is no longer problematic. Later we study adversarial bandits 
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with an infinite action set A, which is equipped with a o-algebra G. In this 
case the reward vectors are replaced by functions (a;)?_,, where x+ : A — [0,1] 
is G-measurable. Then, the measurability condition on the policy is that for all 
choices of the adversary and all B € B(A), 


WB | a4, 21(01),. ++) at—1, 21 (a¢-1)) 


must be measurable as a function of a1,...,a:—1. In practice, of course, all the 
policies you might ever propose would also be measurable as a function of the 
rewards. 


Bibliographic Remarks 


Exponential weighting has been a standard tool in online learning since the 
papers by Vovk [1990] and Littlestone and Warmuth [1994]. Exp3 and several 
variations were introduced by Auer et al. [1995], which was also the first paper to 
study bandits in the adversarial framework. The algorithm and analysis presented 
here differs slightly because we do not add any additional exploration, while the 
version of Exp3 in that paper explores uniformly with low probability. The fact 
that additional exploration is not required was observed by Stoltz [2005]. 


Exercises 


11.1 (SAMPLING FROM A MULTINOMIAL) In order to implement Exp3, you need 
a way to sample from the exponential weights distribution. Many programming 
languages provide a standard way to do this. For example, in Python you can use 
the Numpy library and numpy.random.multinomial. In more basic languages, 
however, you only have access to a function rand() that returns a floating point 
number ‘uniformly’ distributed in [0, 1]. Describe an algorithm that takes as input 
a probability vector p € P,_1 and uses a single call to rand() to return X € [k] 
with P(X = i) = pi. 


On most computers, rand() will return a pseudo-random number, and since 
there are only finitely many floating point numbers, the resulting distribution 
will not really be uniform on [0,1]. Thinking about these issues is a worthy 
endeavour, and sometimes it really matters. For this exercise you may ignore 
these issues, however. 


11.2 (LINEAR REGRET FOR DETERMINISTIC POLICIES) Show that for any 
deterministic policy m there exists an environment x € [0,1]"** such that 
R,(7, x£) > n(1 —1/k). What does your result say about the policies designed in 
Part II? 
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11.3 (MAXIMUM AND EXPECTATIONS) Show that the first inequality in (11.2) 
holds: Moving the maximum inside the expectation increases the value of the 
expectation. 


11.4 (ALTERNATIVE REGRET DEFINITION) Suppose we had defined the regret by 


n n 
track ; 
Ri (m 2) =E max Tti — > Lta,| - 
fai ‘El t=1 


At first sight this definition seems like the right thing because it measures what 


you actually care about. Unfortunately, however, it gives the adversary too much 
nxk 


power. Show that for any policy a (randomised or not), there exists a x € [0,1] 


such that 
f 1 
Rek (m g)> n (1 — =) : 


11.5 (UNBIASED ESTIMATORS ARE IMPORTANCE WEIGHTED) Let P € Pk-1 
be a probability vector with nonzero components and let A ~ P. Suppose 
X : [k] x R > R is a function such that for all x € R*, 


aX (A, x4)] = SAR) = 


Show that there exists an a € R* such that (a, P) = 0 and for all ¢ and z in their 
I{i=1}z 


respective domains, X(i, z) = a; + P 
1 


11.6 (VARIANCE OF ExP3) In this exercise, you will show that if n € [n7?, 1] 
for some p € (0,1), then for sufficiently large n, there exists a bandit on which 
Exp3 has a constant probability of suffering linear regret. We work with losses so 
that given a bandit y € [0,1]"**, the learner samples A; from P, given by 


exp (— n> ca) 
Se exp (- ny“ ly, a: 


where Y;; = AtiYti/ Pu. Let a € [1/4, 1/2] be a constant to be tuned subsequently 
and define a two-armed adversarial bandit in terms of its losses by 


0 ift<n/2 a ift<n/2 
Yu = and Yt2 = 


a= 


1 otherwise 0 otherwise. 
For simplicity you may assume that n is even. 


(a) Define the sequence of real-valued functions q1,...,@, on domain [1/4, 1/2] 
inductively by qola) = 1/2 and 


qs+1 (Q )= 


qs(a) exp(—na/qs()) 
(a 


1 — qs(a) + qs(a) exp(—na/qs(@)) ` 
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Figure 11.4 Exp3 instability: Box and whisker plot of the distribution of the regret of 
Exp3 for different values of a over a horizon of n = 10* with m = 500 repetitions for the 
example of Exercise 11.6. The boxes represent the quartiles of the empirical distribution, 
the diamond shows the average; the median is equal to the upper quartile (and thus 
cannot be seen), while the dots show values outside of the “interquartile range”. 


(b) 


(c) 
(d) 
(e) 


(£) 


Show for t < 1 + n/2 that P;2 = ¢7,(¢-1)(@), where Ta(t) = Eii Ago. 
Show that for sufficiently large n there exists an a € [1/4,1/2] and s € N 
such that 


1 1 n 
— < OTi 
qs (a) and J as 


Prove that P(To(n/2) > s +1) > 1/65. 

Prove that P(R, > n/4) > (1 — nexp(—nn)/2)/65. 

The previous part shows that the regret is linear with constant probability 
for sufficiently large n. On the other hand, a dubious application of Markov’s 
inequality and Theorem 11.1 shows that 


4E[R,,] 


n 


P(Rn > n/4) < 


Explain the apparent contradiction. 

Validate the theoretical results of this exercise in an experimental fashion: 
Implement Exp3 with the loss sequence suggested to reproduce Fig. 11.4. 
The learning rate is set to the value computed in Theorem 11.2: 7 = 
V2 log(k)/(nk). Compare the figure with the theoretical results: Is there an 
agreement between theory and the empirical results? 
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11.7 (GUMBEL TRICK) Let ai,...,a% be positive real values and U;,...,U% be 
a sequence of independent and identically distributed uniform random variables 
on [0,1]. Then let G; = —log(—log(U;)), which follows a standard Gumbel 
distribution. Prove that 
P (ioga: + G; = mi ogla] + c») = 
JER 


Qi 
c 
et Qj 


11.8 (EXP3 AS FOLLOW-THE-PERTURBED-LEADER) Let (Zu) be a collection 
of independent and identically distributed random variables. The follow-the- 
perturbed-leader algorithm chooses 


t—1 
At = argmaX;ejk] (z: = IDD i.) . 
s=1 


Show that if Zn is a standard Gumbel, then follow-the-perturbed-leader is the 
same as Exp. 


11.9 (EXP3 ON STOCHASTIC BANDITS) In this exercise we compare UCB and 
Exp3 on stochastic data. Suppose we have a two-armed stochastic Bernoulli 
bandit with uı = 0.5 and po = wi + A with A = 0.05. 


(a) Plot the regret of UCB and Exp3 on the same plot as a function of the 
horizon n using the learning rate from Theorem 11.2. 

(b) Now fix the horizon to n = 10° and plot the regret as a function of the 
learning rate. Your plot should look like Fig. 11.5. 

(c) Investigate how the shape of this graph changes as you change A. 

(d) Find empirically the choice of 7 that minimises the worst-case regret over all 
reasonable choices of A, and compare to the value proposed by the theory. 

(e) What can you conclude from all this? Tell an interesting story. 


HINT The performance of UCB depends greatly on which version you use. For 
best results, remember that Bernoulli distributions are 1/2-subgaussian or use 
the KL-UCB algorithm from Chapter 10. 
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Figure 11.5 Expected regret for Exp3 for different learning rates over n = 10° rounds 


on a Bernoulli bandit with means pi = 0.5 and u2 = 0.55. 
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12.1 


The Exp3-IX Algorithm 


In the last chapter, we proved a sublinear bound on the expected regret of Exp3, 
but with a dishearteningly large variance. The objective of this chapter is to 
modify Exp3 so that the regret stays small in expectation and is simultaneously 
well concentrated about its mean. Such results are called high-probability 
bounds. By slightly modifying the algorithm, we show that for each 6 € (0,1), 
there exists an algorithm such that with probability at least 1 — ô, 


: k 
n = — Yta) = kl -= r 
R may ) lua: Yta) of n ve(4)] 


The poor behaviour of Exp3 occurs because the variance of the importance- 
weighted estimators can become very large. In this chapter we modify the reward 
estimates to control the variance at the price of introducing some bias. 


The Exp3-IX Algorithm 


We start by summarising what we know about the behaviour of the random regret 
of Exp3. Because we want to use the loss-based estimator, it is more convenient 
to switch to losses, which we do for the remainder of the chapter. Rewriting 
Eq. (11.15) in terms of losses, 


log(k) i 

În- Ens SE IE Ens (12.1) 
j=l 

where În and Lai are defined using the loss estimator Yi; by 


n 


k n 
În =) ee and Ln =X fa. 
t=1 


t=1 j=1 


Eq. (12.1) holds no matter how the loss estimators are chosen, provided 
they satisfy 0 < Vest /P,; for all t and i. Of course, the left-hand side of 
Eq. (12.1) is not close to the regret unless Y;; is a reasonable estimator of 
the loss yti. 


E 


12.1 The Exp3-IX Algorithm 166 


We also need to define the sum of losses observed by the learner and for each 
fixed action, which are 


n n 
In=> ya, and Ln => yu 
t=1 t=1 


Like in the previous chapter, we need to define the (random) regret with respect 
to a given arm 7 as follows: 


Rui = 3 Tti — D = İn — Lpi. (12.2) 
t=1 t=1 


By substituting the above definitions into Eq. (12.1) and rearranging, the regret 
with respect to arm 7 is bounded by 


A A A 


log(k) 
n 


< 


k 
~ A A n A 
+ (En — În) + (Eni - Eni) + 5 X Ên. (12.3) 


This means the random regret can be bounded by controlling L= În, Lng — Ln; 
and Lig for each j. As promised we now modify the loss estimate. Let y > 0 be 
a small constant to be chosen later and define the biased estimator 

Ê; = HA = i} % (12.4) 

Put y 

First, note that Y;; still satisfies 0 < fz 1/P;i, so (12.3) is still valid. As y 
increases, the predictable variance decreases, but the bias increases. The optimal 
choice of y depends on finding the sweet spot, which we will do once the dust 
has settled in the analysis. When Eq. (12.4) is used in the exponential update in 
Exp3, the resulting algorithm is called Exp3-IX (Algorithm 10). The suffix ‘TX’ 
stands for implicit exploration, a name justified by the following argument. A 
simple calculation shows that 


. Paw, 5 
u [Yu] = ni = Yti as . 
7 


< Yti- 


Since small losses correspond to large rewards, the estimator is optimistically 
biased. The effect is a smoothing of P, so that actions with large losses for which 
Exp3 would assign negligible probability are still chosen occasionally. In fact, the 
smaller is P,;, the larger the bias is. As a result, Exp3-IX will explore more than 
the standard Exp3 algorithm (see Exercise 12.5). 


The reason for calling the exploration implicit is because the algorithm 
explores more as a consequence of modifying the reward estimates, rather 
than directly alternating P,. 
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1: Input: n, k,n, 7 

2: Set Lo; = 0 for all i 

3: for t=1,...,n do 

4 Calculate the sampling distribution P,: 


exp (-nÊe-1) 
E 7 
ye j=1 EXP (n-i) 


5: Sample A, ~ P, and observe reward X; 
I1{4; = 2} (1 — X4) 
Paty 


Py = 


6: Calculate Îi = Pii + 


7: end for 


Algorithm 10: Exp3-IX. 


Regret Analysis 


We now prove the following theorem bounding the random regret of Exp3-IX 
with high probability. 


THEOREM 12.1. Let 6 € (0,1) and define 


2log(k +1) log(k) + log( =) 
m= ——— and 72 = é 
nk nk 


The following statements hold: 


1 If Exp3-IX is run with parameters n = 1 and y = 7/2, then 


x nk 1 k+1 
P > klog(k+1 l + log <6. 
(a> 8nk log(k + 1) + Dlog(k + 1) os (5) os ( 5 )) <6 


2 If Exp3-IX is run with parameters n = n2 and y = 7/2, then 


P @ > 2y/(2log(k + 1) + log(1/5))nk + log (“+)) <6. (12.6) 


The value of ņ is independent of 6, which means that using this choice of 
learning rate leads to a single algorithm with a high-probability bound for 
all 6. On the other hand, 72 does depend on 6, so the user must choose 
a confidence level from the beginning. The advantage is that the bound 
is improved, but only for the specified confidence level. We will show in 
Chapter 17 that this trade-off is unavoidable. 


The proof follows by bounding each of the terms in Eq. (12.3), which we do 
via a series of lemmas. The first of these lemmas is a new concentration bound. 
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To state the lemma, we recall two useful notions: Recall that given a filtration 
F = (Fi)peo, (Zt) is F-adapted if for t € [n], Z is F;-measurable and (Z;)?_, 
is F-predictable, if for t € [n], Z is F,-1-measurable. 


LEMMA 12.2. Let F = (F;)?o be a filtration and for i € [k] let (Yui) be F-adapted 
such that: 


1 for any S C [k] with |S| > 1, E [Ties Yri | Fes] <0; and 
2 E [Yu | Fri] = yri for all t € [n] and i € [k]. 


Furthermore, let (a4;)+; and (Atri)ti be real-valued F-predictable random sequences 
such that for all t,i it holds that O < ayYi; < 2i. Then, for all 6 € (0,1), 


(Eiei jea): 


The proof relies on the Cramér—Chernoff method and is deferred until the 
end of the chapter. Condition 1 states that the variables {Fai} are negatively 
correlated, and it helps us save a factor of k. Equipped with this result, we can 
easily bound the terms Li — Ini. 


LEMMA 12.3 (Concentration — variance). Let 6 € (0,1). With probability at least 
1 — ô, the following inequalities hold simultaneously: 


and 5 (i ni — Lni) £ < =- . (12.7) 


Proof Fix 6’ € (0,1) to be chosen later and let An = 1{A; = i} as before. Then 


Ua, i AtiYti 
> Ên — Lr) = ee (= a4 = va) 


r log( =) 
max (ini _ Ini) < log >) 
i€ [k] 2y 


I= t=1 i=1 
1 : ( 1 Antti 
=a 5 2y x — Yti | - 
27 t=1 i=l 1+ Pa Pri 
Introduce 4; = =, Yu = Artei and aş = 2y. Notice that the conditions of 


Lemma 12.2 are now satisfied. In particular, for any S C [k] with |S| > 1, it holds 
that I [ies Ati = 0 and hence [],<¢ Y,; = 0. Therefore, 


k 7 

P paca aiye aa) <. (12.8) 

3 Y 

i=l 
Similarly, for any fixed i, 

i 

P (in = Ini = — <s. (12.9) 
Y 


To see this, use the previous argument with az; = I {j = i} 2y. The result follows 
by choosing 6’ = 6/(k + 1) and the union bound. 


LEMMA 12.4 (Bias). In — Ên = y X$ Ênj - 
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Proof Let Arn = 1{A; = i} as before. Writing Y; = ay AtjYtj, we calculate 


k k k k 
i Pe Ag n 
Y-Y Psy =Y (1-2 ) Aym = 75 dy = yd Êy. 
t tjt tj = ( pit) tjYtj e= Py +y” Y tj 


Therefore Ly, — Ên = i Înj as required. 


Proof of Theorem 12.1 By Eq. (12.3) and Lemma 12.4, we have 


x log(k = 
f, < loath) (in 


~ g i€[k] Ia 
log(k) HAs 
= + Max Dri = Dini + (2 + ) gs š 
n ari ) zI 2 í 
Therefore, by Lemma 12.3, with probability at least 1 — ô, 
a — log(k) _ log (##) ny |$ log (#5) 
R, < g (y+ 2) [0 Lyt = 
l log (#44 
P EE 4 (y4 2) nk + (y4 141) og (TS) 
n 2 2 2y 


where the second inequality follows since L,; < n for all 7. The result follows by 
substituting the definitions of 7 € {71,72} and y = 7/2. 


The attentive reader may be wondering whether proving the new 
concentration inequality of Lemma 12.2, which looks a bit ad hoc, was 
really necessary to get the bounds on Ly; that were stated in the next 
lemma. After all, we had a number of concentration inequalities available 
to us that could be applied. As it turns out, one could also use Bernstein 
inequality to get a result that only loses a factor of two compared to the 
specialised lemma. The details are in Exercise 12.1. There are two important 
lessons that are the basis of both proofs. The first is that since E[Lns Mra 
and the gap between these two quantities is large enough in a manner that 
we make precise in Exercise 12.1, the deviation T — Ln; can be bounded 
independently of P;;, Az; and yti. The price is that instead of ,/log(1/06), the 
bound scales linearly with the generally larger quantity log(1/d). The factor 
1/7 here is the maximum scale of the individual summands in ie The 
second lesson is specific to how in bounding )°, Le — Ly; a union bound over 
i is avoided: this works because for a fixed time index t, (Ari); are negatively 
correlated. Negative dependence/association/correlation are known to be 
good substitutes for independence, and by exploiting such properties one 
can often demonstrate better concentration. 


12.2.1 
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Proof of Lemma 12.2 


We start with a technical inequality: 


LEMMA 12.5. For any 0 < x < 2A it holds that exp (5) <li+rz. 


Note that 1+a < exp(x). What the lemma shows is that by slightly discounting 
the argument of the exponential function, in a bounded neighbourhood of zero, 
1+ zx can be an upper bound for the resulting function. Or, equivalently, slightly 
inflating the linear term in 1+ x, the linear lower bound becomes an upper bound. 


Proof of Lemma 12.5 We have 


exp = < exp — <1l+qa, 
1+A 1+ 2/2 


where the first inequality is because A ++ exp( qx) is decreasing in A, and the 
second is because ts < log(1 + 2u) holds for all u > 0. This latter inequality 
can be seen to hold by noting that for u = 0, the two sides are equal, while the 
derivative of the left-hand side is smaller than that of the right-hand side at any 


u > 0. 


Proof of Lemma 12.2 Fix t € |n] and let E;[-] = E[: | 7;] denote the conditional 
expectation with respect to F. By Lemma 12.5 and the assumption that 
0 < anYu < 2A, we have 


Taking the product of these inequalities over i, 


k = k 
Oti Yri 5 ~ 
bt—1 fe (>: I a ) < Ey-1 e + Qati Ytri) 
; ti 


k 
<14+Ey-1 [E ona 
i=1 j= (= 


k k 
1+ 5 QAtiYti L exp (>: oun 7 (12.10) 
i=1 i=1 
where the second inequality follows from ma + ai) = Veeto,1}* m a’ and 
the assumption that for S C [k] with |S| > 1, Eill hies Yii] < 0, the third one 
follows from the assumption that E,-1[Y:] = yti, while the last one follows from 
1 +z < exp(x). Define 


II 


“ Yui 
Zt = exp X ani Ipag "Ë 


and let Mi = Z1 ... Ze, t € [n] with Mo = 1. By (12.10), Ex-1[Z,] < 1. Therefore 


[Mi] = E[E;-1 [M4] = E[M:-1Et-1[2]] < E[M:-1] < -+ < E[Mo] = 1. 


Setting t = n and combining the above display with Markov’s inequality leads to 
P (log(Mn) > log(1/8)) = P (Mô > 1) < E [M,] 6 < ô. 
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Notes 


1 An alternative to the somewhat custom-made Lemma 12.2 is to use a Bernstein- 
type bound that simply bounds the deviation of a martingale from its mean 
in terms of its quadratic variation. The slight disadvantage of this is that this 
way we lose a factor of two. If this is not a concern, one may even prefer this 
approach due to its greater transparency. For details, see Exercise 12.1. 

2 An upper bound on the expected regret of Exp3-IX can be obtained by 
integrating the tail: 


R, < (Ra) = f P(t > 2) des fP (Rn > 2) de, 


where the first equality follows from Proposition 2.8. The result is completed 
using either the high-probability bound in Theorem 12.1 and by straightforward 
integration. We leave the details to the reader in Exercise 12.7. 

3 The analysis presented here uses a fixed learning rate that depends on the 
horizon. Replacing 7 and y with m = ./log(k)/(kt) and y+ = 7/2 leads to an 
anytime algorithm with about the same regret [Kocák et al., 2014, Neu, 2015al. 

4 There is another advantage of the modified importance-weighted estimators 
used by Exp3-IX, which leads to an improved regret in the special case that 
one of the arms has small losses. Specifically, it is possible to show that 


R, =O ( k min Ly; at) ; 
i€ [k] 


In the worst case, Ln; is linear in n and the usual bound is recovered. But 
if the optimal arm enjoys low cumulative regret, then the above can be a 
big improvement over the bounds given in Theorem 12.1. Bounds of this 
kind are called first-order bounds. We refer the interested reader to the 
papers by Allenberg et al. [2006], Abernethy et al. [2012] and Neu [2015b] and 
Exercise 28.14. 

Another situation where one might hope to have a smaller regret is when the 
rewards/losses for each arm do not deviate too far from their averages. Define 
the quadratic variation by 


n n 
1 
Qn = X lz- all? , where y= = 5 Tt. 
t=1 t=1 


Hazan and Kale [2011] gave an algorithm for which R, = O(k?,\/Qn), which can 
be better than the worst-case bound of Exp3 or Exp3-IX when the quadratic 
variation is very small. The factor of k? is suboptimal and can be removed 
using a careful instantiation of the mirror descent algorithm [Bubeck et al., 
2018]. We do not cover this exact algorithm in this book, but the techniques 
based on mirror descent are presented in Chapter 28. 

An alternative to the algorithm presented here is to mix the probability 
distribution computed using exponential weights with the uniform distribution, 


on 


aD 
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while biasing the estimates. This leads to the Exp3.P algorithm due to Auer 
et al. [2002b], who considered the case where ô is given and derived a bound that 
is similar to Eq. (12.6) of Theorem 12.1. With an appropriate modification of 
their proof, it is possible to derive a weaker bound similar to Eq. (12.5), where 
the knowledge of 6 is not needed by the algorithm. This has been explored by 
Beygelzimer et al. [2010] in the context of a related algorithm, which will be 
considered in Chapter 18. One advantage of this approach is that it generalises 
to the case where the loss estimators are sometimes negative, a situation that 
can arise in more complicated settings. For technical details, we advise the 
reader to work through Exercise 12.3. 


Bibliographic Remarks 


The Exp3-IX algorithm is due to Kocak et al. [2014], who also introduced the 
biased loss estimators. The focus of that paper was to improve algorithms for 
more complex models with potentially large action sets and side information, 
though their analysis can still be applied to the model studied in this chapter. The 
observation that this algorithm also leads to high-probability bounds appeared in 
a follow-up paper by Neu [2015a]. High-probability bounds for adversarial bandits 
were first provided by Auer et al. [2002b] and explored in a more generic way by 
Abernethy and Rakhlin [2009]. The idea to reduce the variance of importance- 
weighted estimators is not new and seems to have been applied in various forms 
[Uchibe and Doya, 2004, Wawrzynski and Pacut, 2007, Ionides, 2008, Bottou 
et al., 2013]. All of these papers are based on truncating the estimators, which 
makes the resulting estimator less smooth. Surprisingly, the variance-reduction 
technique used in this chapter seems to be recent [Kocák et al., 2014]. 


Exercises 


12.1 (BERNSTEIN-TYPE INEQUALITY AND LEMMA 12.3) Using the Berstein-type 
inequality stated in Exercise 5.15, show the following: 


(a) For any ô € (0,1), with probability at least 1 — ô, Ly; — Lni < 3 log(1/6). 

(b) For any 6 € (0,1), with probability at least 1 — ô, 50, Las lage 
3 log(1/6). 

12.2 Prove the claims made in Note 3. 

Hint The source for this exercise is theorem 1 of the paper by Neu [2015a]. 


You can also read ahead and use the techniques from Exercise 28.13. 


12.3 (Exp3.P) In this exercise we ask you to analyse the Exp3.P algorithm, 
which as we mentioned in the notes is another way to obtain high probability 
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bounds. The idea is to modify Exp3 by biasing the estimators and introducing 
some forced exploration. Let Yui = AtiYti/ Pri — B/P be a biased version of the 
loss-based importance-weighted estimator that was used in the previous chapter. 
Define Li = yy Ŷ,; and consider the policy that samples A; ~ P;, where 


Z y cs exp (-nbi-1s) 
Py = (1 — y) Pat a with = Pa=— - . 
jai exp (n15) 


(a) Let 6 € (0,1) and ¿i € [k]. Show that with probability 1 — 4, the random 
regret R,,; against i (cf. (12.2)) satisfies 


x ae yd 7 B nlog(1/8) 
Rng < ry + O= Tuut > a ae 
(b) Show that 


n k n k 
D Pyu- yer) =) Y Palu- Yes) t+) Ou- yu). 


t=1 j=1 t=1 j=1 t=1 


(c) Show that 


t=1 j=1 t=1 j=1 
(d) Show that 
n k n 
a B 1 
Pave < 
Xd T Da 
=1j=1 1 


(e) Suppose that y = kn and 7 = 8. Apply the result of Exercise 5.15 to show 
that for any 6 € (0,1), the following hold: 


P 1 k 1 
p > 2nk + —log{=})] <6. 


t=1 


P SON iia i <ô. 
= B AS 


(£) Combining the previous steps, show that there exists a universal constant 
C > 0 such that for any 6 € (0,1), for an appropriate choice of ņ, y and £, 
with probability at least 1 — ô it holds that the random regret R, of Exp3.P 
satisfies 


Rn < C\/nk log(k/6). 


(g) In which step did you use the modified estimators? 

(h) Show a bound where the algorithm parameters ņ, y, 8 can only depend on 
n, k, but not on ô. 

(i) Compare the bounds with the analogous bounds for Exp3-IX in Theorem 12.1. 
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12.4 (GENERIC EXP3.P ANALYSIS) This exercise is concerned with a 
generalisation of the core idea underlying Exp3.P of the previous exercise in that 
rather than giving explicit expressions for the biased loss estimates, we focus 
on the key properties required by the analysis of Exp3.P. To reduce clutter, we 
assume for the remainder that t ranges in [n] and a € [k]. Let (Q, 7,F,P) bea 
filtered probability space with F = (Fy); > Let (Z+), (Z:), (Ž+), (6t) be sequences 
of random elements in R*, where Z, = Z, — 6, and (Z+), (6+) are F-predictable, 
whereas (Z;) and therefore also (Z;) are are You should think of Z; as 
the estimate of Z, that uses randomisation, and & is the bias as in the previous 
exercise. Given positive constant 7, define the probability vector P, € Pk-1 by 


exp (=n ee Lig) 


Pra = = ie 
Dip=i EXP (=n ai Zs») 
Let Ez-1[-] = E[-| 7-1]. Assume the following hold for all a € [k]: 
(a) n Zeal < 1, (b) Eta < 1, 
(c) E i22] < Bra almost surely , (a) E il Zia = Ziq almost surely. 
n k 
Let A* = argmingety) Dci Zta and Rn = XO XO Pra(Zta — Zea). 
t=1 a=1 
(a) Show that 
n k 
NOYO Pra(Zta — Zia) 
t=1 a=1 
n k k n 
=X X Pra(Zta — Zea) ye XO Pra(Zta — Zea) >, Žiar — Ziar) 
t=1a=1 = 1 a=1 t=1 
(A) (B) (C) 


(b) Show that 


(Ass a S Pe? £357 Profle. 


t=1 a=1 t=1 a=1 
(c) Show that with probability at least 1 — ô, 
log(1/6 
B) <29° Pabia + 2l saa, 
t=1a=1 


(d) Show that with probability at least 1 — kô, 


log(1/0) 
(©) < ge 
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(e) Conclude that for any ô < 1/(k +1), with probability at least 1 — (k + 1)ô, 


pa N nY SE Pah +550 Y Paba. 


" t=1 a=1 t=1a=1 


HINT This is a long and challenging exercise. You may find it helpful to use 
the result in Exercise 5.15. The solution is also available. 


12.5 (IMPLEMENTATION) Consider the Bernoulli bandit with k = 5 arms and 
n = 10* with means pı = 1/2 and u; = 1/2—A for i > 1. Plot the regret of Exp3 
and Exp3-IX for A € [0,1/2]. You should get a plot similar to that of Fig. 12.1. 
Does the result surprise you? 


Regret 


Figure 12.1 Comparison between Exp3 and Exp3-IX on Bernoulli bandit 


12.6 (IMPLEMENTATION: VARIANCE OF Exp3-IX) Repeat the experiment that 
led to Fig. 11.4 but with Exp3 swapped to Exp3-IX. Use the confidence parameter 
independent value of 7 and y from Theorem 12.1. You should get a figure similar 
to Fig. 12.2. Compare the new and the old figures and summarise your findings, 
including the outcome of the results of Exercise 12.5. 


12.7 (EXPECTED REGRET OF Exp3-IX) In this exercise, you will complete the 
steps explained in Note 2 to prove a bound on the expected regret of Exp3-IX. 


(a) Find a choice of 7 and universal constant C > 0 such that 


Rn < Cy knlog(k). 


(b) What happens as 7 grows? Write a bound on the expected regret of Exp3-IX 
in terms of 7 and k and n. 
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Figure 12.2 Box and whisker plot of the regret of Exp3-IX for the same setting as those 


used to produce Fig. 11.4. For details of the experimental settings, see the text of 
Exercise 11.6. 


Part IV 


Lower Bounds for Bandits 
with Finitely Many Arms 
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Until now, we have indulged ourselves by presenting algorithms and upper 
bounds on their regret. As satisfying as this is, the real truth of a problem is 
usually to be found in the lower bounds. There are several reasons for this: 


1 An upper bound does not tell you much about what you could be missing 
out on. The only way to demonstrate that your algorithm really is (close to) 
optimal is to prove a lower bound showing that no algorithm can do better. 

2 The second reason is that lower bounds are often more informative in the 
sense that it usually turns out to be easier to get the lower bound right than 
the upper bound. History shows a list of algorithms with steadily improving 
guarantees until eventually someone hits upon the idea for which the upper 
bound matches some known lower bound. 

3 Finally, thinking about lower bounds forces you to understand what is hard 
about the problem. This is so useful that the best place to start when attacking 
a new problem is usually to try and prove lower bounds. Too often we have 
not heeded our own advice and started trying to design an algorithm, only 
to discover later that had we tackled the lower bound first, then the right 
algorithm would have fallen in our laps with almost no effort at all. 


So what is the form of a typical lower bound? In the chapters that follow, we 
will see roughly two flavours. The first is the worst-case lower bound, which 
corresponds to a claim of the form 


‘For any policy you give me, I will give you an instance of a bandit problem v on which 
the regret is at least L’. 


Results of this kind have an adversarial flavour, which makes them suitable for 
understanding the robustness of a policy. The second type is a lower bound on 
the regret of an algorithm for specific instances. These bounds have a different 
form that usually reads like the following: 


‘If you give me a reasonable policy, then its regret on any instance v is at least L(v)’. 


The statement only holds for some policies — the ‘reasonable’ ones, whatever that 
means. But the guarantee is also more refined because bound controls the regret 
for these policies on every instance by a function that depends on this instance. 
This kind of bound will allow us to show that the instance-dependent bounds 
for stochastic bandits of O()0;.a,59 Ai + log(n)/A:) are not improvable. The 
inclusion of the word ‘reasonable’ is unfortunately necessary. For every bandit 
instance v there is a policy that just chooses the optimal action in v. Such policies 
are not reasonable because they have linear regret for bandits with a different 
optimal arm. There are a number of ways to define ‘reasonable’ in a way that is 
simultaneously rigorous and, well, reasonable. 

The contents of this part is roughly as follows. First we introduce the definition 
of worst-case regret and discuss the line of attack for proving lower bounds 
(Chapter 13). The next chapter takes us on a brief excursion into information 
theory, where we explain the necessary mathematical tools (Chapter 14). Readers 
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familiar with information theory could skim this chapter. The final three chapters 
are devoted to applying information theory to prove lower bounds on the regret 
for both stochastic and adversarial bandits. 


13 


Lower Bounds: Basic Ideas 


The worst-case regret of a policy 7 on a set of stochastic bandit environments 
E is 


R,(7,€) = sup Ry (7, v). 
VEE 


Let II be the set of all policies. The minimax regret is 


R*(E) = inf R,(x,€) = inf sup Rp(m,v). 
n(€) = inf Ra(m, €) aoe (m, v) 


A policy is called minimax optimal for € if R,(7,€) = Rž (£). The value R} (£) 
is of interest by itself. A small value of Rý (£) indicates that the underlying bandit 
problem is less challenging in the worst-case sense. A core activity in bandit 
theory is to understand what makes Rž (E) large or small, often focusing on its 
behaviour as a function of the number of rounds n. 


Minimax optimality is not a property of a policy alone. It is a property of a 
policy together with a set of environments and a horizon. 


Finding a minimax policy is generally too computationally expensive to be 
practical. For this reason, we almost always settle for a policy that is nearly 
minimax optimal. 

One of the main results of this part is a proof of the following theorem, which 
together with Theorem 9.1 shows that Algorithm 7 from Chapter 9 is minimax 
optimal up to constant factors for 1-subgaussian bandits with suboptimality gaps 
in [0,1]. 


THEOREM 13.1. Let EF be the set of k-armed Gaussian bandits with unit variance 
and means u € [0,1]*. Then there exists a universal constant c > 0 such that for 
allk >1 andn > k, it holds that R*(E*) > eVkn. 


We will prove this theorem in Chapter 15, but first we give an informal 
justification. 


13.1 
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Main Ideas Underlying Minimax Lower Bounds 


Let X1,..., Xn be a sequence of independent Gaussian random variables with 
unknown mean p and known variance 1. Assume you are told that u takes on 
one of two values: u = 0 or u = A for some known A > 0. Your task is to guess 
the value of u based on your observation of X1,..., Xn. Let f= Ł yy X; be 
the sample mean, which is Gaussian with mean py and variance 1/n. While it is 
not immediately obvious how easy this task is, intuitively we expect the optimal 
decision is to predict that u = 0 if fi is closer to 0 than to A, and otherwise to 
predict u = A. For large n we expect our prediction will probably be correct. 
Supposing that u = 0 (the other case is symmetric), then the prediction will be 
wrong only if ĝ > A/2. Using the fact that f is Gaussian with mean u = 0 and 
variance 1/n, combined with known bounds on the Gaussian tail probabilities 
(see Eq. (13.4)), leads to 


1 [8 . ( me) sP(az 3) 
" = 
Vn? + /nA? +16 V 7 E 8 7 we 2 


2 1 e ( r) 
ex f 
VnA? + yna? + 32/7 T g 8 


(13.1) 


The upper and lower bounds only differ in the constant in the square root of the 
denominator. One might believe that the decision procedure could be improved, 
but the symmetry of the problem makes this seem improbable. The formula 
exhibits the expected behaviour, which is that once n is large relative to 8/A?, 
then the probability that this procedure fails drops exponentially with further 
increases in n. But the lower bound also shows that if n is small relative to 8/A?, 
then the procedure fails with constant probability. 

The problem described is called hypothesis testing, and the ideas underlying 
the argument above are core to many impossibility results in statistics. The next 
task is to reduce our bandit problem to hypothesis testing. The high-level idea 
is to select two bandit problem instances in such a way that the following two 
conditions hold simultaenously: 


1 Competition: An action, or, more generally, a sequence of actions that is good 
for one bandit is not good for the other. 

2 Similarity: The instances are ‘close’ enough that the policy interacting with 
either of the two instances cannot statistically identify the true bandit with 
reasonable statistical accuracy. 


The two requirements are clearly conflicting. The first makes us want to choose 
instances with means ju, u’ € [0,1]* that are far from each other, while the second 
requirement makes us want to choose them to be close to each other. The lower 
bound will follow by optimising this trade-off. 

Let us start to make things concrete by choosing bandits v = (P;)‘_, and 
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v! = (P/)*_,, where P; = N(pi,1) and P! = N(w',1) are Gaussian and 
uu’ € [0,1]*. We will also assume that n is larger than k by some suitably 
large constant factor. In order to prove a lower bound, it suffices to show that for 
every strategy 7, there exists a choice of u and p’ such that 


max {R (1, v), Ry(a,v')} > eVkn, 


where c > 0 is a universal constant. Let A € (0,1/2] be a constant to be tuned 
subsequently and choose u = (A,0,0,...,0), which means that the first arm is 
optimal in instance v and 


Ran, v) = (n — ET (n))A, (13.2) 


where the expectation is taken with respect to the induced measure on the 
sequence of outcomes when ~v interacts with v. Now we need to choose p’ to 
satisfy the two requirements above. Since we want v and v’ to be hard to 
distinguish and yet have different optimal actions, we should make p’ as close to 
H except in a coordinate where m expects to explore the least. To this end, let 


i = argmin,,, |T} (n)] 


be the suboptimal arm in v that m expects to play least often. From n = 
[Ti (n)] + X> EIT} (n)] 2 (k — 1)E[T;(n)] we see that 


must hold. Then, define u’ € RF by 
i= Hj, ifj#i; 
f 2A, otherwise. 


The regret in this bandit is 


Ra (T, v’) = AE'[Ti(n)] + $5 2AE[T;(n)] > AE [Ti (n)], (13.3) 
ighi 


where E’[] is the expectation operator on the sequence of outcomes when m 
interacts with v’. So now we have the following situation: the strategy m interacts 
with either v or v’, and when interacting with v, it expects to play arm i at 
most n/(k — 1) times. But the two instances only differ when playing arm i. The 
time has come to tune A. Because the strategy expects to play arm i only about 
n/(k —1) times, taking inspiration from the previous discussion on distinguishing 
samples from Gaussian distributions with different means, we will choose 


A= 


o NV n 


If we are prepared to ignore the fact that T;(n) is a random variable and take for 
granted the claims in the first part of the chapter, then with this choice of A, 
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the strategy cannot distinguish between instances v and v’, and in particular we 
expect that E[Ti(n)] ~ E’[T,(n)]. If E[T\(n)] < n/2, then by Eq. (13.2) we have 


n {/k-1 1 
> = ; 
Ramo) > By E = iVn) 
On the other hand, if E[T; (n)] > n/2, then 


Ra(T, v) > AE[T,(n)] ~ AE[T,(n)] > iVn —1), 


which completes our heuristic argument that there exists a universal constant 
c > 0 such that 


R*(E*) > cVnk. 


We have been sloppy in many places. The claims in the first part of the chapter 
have not been proven yet, and T;(n) is a random variable. Before we can present the 
rigourous argument, we need a chapter to introduce some ideas from information 
theory. Readers already familiar with these concepts can skip to Chapter 15 for 
the proof of Theorem 13.1. 


Notes 


1 The worst-case regret has a game-theoretic interpretation. Imagine a game 
between a protagonist and an antagonist that works as follows: for k > 1 and 
n > k the protagonist proposes a bandit policy m. The antagonist looks at the 
policy and chooses a bandit v from the class of environments considered. The 
utility for the antagonist is the expected regret, and for the protagonist it is the 
negation of the expected regret, which makes this a zero-sum game. Both players 
aim to maximise their pay-offs. The game is completely described by n and £. 
One characteristic value in a game is its minimax value. As described above, 
this is a sequential game (the protagonist moves first, then the antagonist). The 
minimax value of this game from the perspective of the antagonist is exactly 
R*(E), while for the protagonist, it is sup, inf,(—Rp(a,v)) = —Rž (€). 

2 We mentioned that finding the minimax optimal policy is usually 
computationally infeasible. In fact it is not clear we should even try. In classical 
statistics, it often turns out that minimising the worst case leads to a flat 
risk profile. In the language of bandits, this would mean that the regret is 
the same for every bandit (where possible). What we usually want in practice 
is to have low regret against ‘easy’ bandits and larger regret against ‘hard’ 
bandits. The analysis in Part II suggests that easy bandits are those where the 
suboptimality gaps are large or very small. There is evidence to suggest that 
the exact minimax optimal strategy may not exploit these easy instances, so 
in practice one might prefer to find a policy that is nearly minimax optimal 
and has much smaller regret on easy bandits. We will tackle questions of this 
nature in Chapter 16. 
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3 The regret on a class of bandits € is a multi-objective criterion. Some policies 
will be good for some instances and bad on others, and there are clear trade- 
offs. One way to analyse the performance in a multi-objective setting is called 
Pareto optimality. A policy is Pareto optimal if there does not exist another 
policy that is a strict improvement — more precisely, if there does not exist a 
T’ such that R,(a’,v) < Rp(a,v) for all v € E and R,(7',v) < Rp(7,v) for 
at least one instance v € E. 

4 When we say a policy is minimax optimal up to constant factors for finite-armed 
1-subgaussian bandits with suboptimality gaps in [0,1], we mean there exists a 
C > 0 such that 


Rn(T, EF) 


Re (E*) < C for all k and n, 


where E* is the set of k-armed 1-subgaussian bandits with suboptimality gaps 
in [0,1]. We often say a policy is minimax optimal up to logarithmic factors, 
by which we mean that 


Rahn, EF) 

R(E) < C(n, k) for all k and n, 
where C (n, k) is logarithmic in n and k. We hope the reader will forgive us 
for not always specifying in the text exactly what is meant and promise that 
statements of theorems will always be precise. 
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Exercises 


13.1 (MINIMAX RISK FOR HYPOTHESIS TESTING) Let P, = N(ju,1) be the 
Gaussian measure on (R,8(R)) with mean u € {0, A} and unit variance. Let 
X : R > R be the identity random variable (X(w) = w). For decision rule 
d: R > {0, A}, define the risk 


R(d) = soe P (d(X) # u), 


Prove that R(d) is minimised by d(x) = argminje,o,a} |£ — À|. 
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13.2 (PARETO OPTIMAL POLICIES) Let k > 1 and E = €,(1) be the set of 
Gaussian bandits with unit variance. Find a Pareto optimal policy for this class. 


Hint Think about simple policies (not necessarily good ones) and use the 
definition. 


14 


14.1 


= 


Foundations of Information Theory 
(2) 


To make the arguments in the previous chapter rigourous and generalisable to 
other settings, we need some tools from information theory and statistics. The 
most important of these is the relative entropy, also known as the Kullback— 
Leibler divergence named for Solomon Kullback and Richard Leibler (KL 
divergence, for short). 


Entropy and Optimal Coding 


Alice wants to communicate with Bob. She wants to tell Bob the outcome of a 
sequence n of independent random variables sampled from known distribution Q. 
Alice and Bob agree to communicate using a binary code that is fixed in advance 
in such a way that the expected message length is minimised. The entropy of Q 
is the expected number of bits necessary per random variable using the optimal 
code as n tends to infinity. The relative entropy between distributions P and Q 
is the price in terms of expected message length that Alice and Bob have to pay 
if they believe the random variables are sampled from Q when in fact they are 
sampled from P. 

Let P be a measure on [N] with o-algebra 2W] and X : [N] > [N] be the 
identity random variable, X (w) = w. Alice observes a realisation of X and wants 
to communicate the result to Bob using a binary code that they agree upon 
in advance. For example, when N = 4, they might agree on the following code: 
1 > 00,2 + 01,3 > 10,4 > 11. Then if Alice observes a 3, she sends Bob a 
message containing 10. For our purposes, a code is a function c: [N] > {0,1}", 
where {0,1}* is the set of finite sequences of zeros and ones. 

Of course c must be injective so that no two numbers (or symbols) have the 
same code. We also require that c be prefix free, which means that no code is a 
prefix of any other. This is justified by supposing that Alice would like to tell 
Bob about multiple samples. Then Bob needs to know where the message for one 
symbol starts and ends. 


Using a prefix code is not the only way to enforce unique decodability, but 
all uniquely decodable codes have equivalent prefix codes (see Note 1). 
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The easiest choice is to use [log.(N)] bits 


no matter the value of X. This simple code - aa - OSE 

: E|010 |D 4100000 
is sometimes effective, but is not entirely = - 11000011 
satisfactory if X is far from uniform. To yn u OOOO 
understand why, suppose that N is extremely Ao E 4400001014 
large and P(X = 1) = 0.99, and the |; |4000 |m 41000010101 
remaining probability mass is uniform over N |0111 |W 11000010100 
[N] \ {1}. Then it seems preferable to havea | s [0011 | y a 
short code for one and slightly longer codes for H |0010 | P 
the alternatives. With this in mind, a natural R |0001 |G 


objective is to find a code that minimises the 


expected code length. That is, Figure 14.1 A Huffman code for the 


English alphabet, including space. 
N 


c* =argmin, X pif(c(i)), (14.1) 


i=l 


where the argmin is taken over valid codes and ¢(-) is a function that returns 
the length of a code. The optimisation problem in (14.1) can be solved using 
Huffman coding, and the optimal value satisfies 


N 
H(P) < X pi~ (i)) < M(P)+1, (14.2) 


where Ho(P) is the entropy of P, 


H(P)= X` pilogs (=) 


i€[N]:pi>0 


When p; = 1/N is uniform, the naive idea of using a code of uniform length is 
recovered, but for non-uniform distributions, the code adapts to assign shorter 
codes to symbols with larger probability. It is worth pointing out that the sum 
is only over outcomes that occur with non-zero probability, which is motivated 
by observing that limz_,94 7log(1/z) = 0 or by thinking of the entropy as an 
expectation of the log probability with respect to P, and expectations should not 
change when the value of the random variable is perturbed on a measure zero set. 

It turns out that H2(P) is not just an approximation on the expected length of 
the Huffman code, but is itself a fundamental quantity. Imagine that Alice wants 
to transmit a long string of symbols sampled from P. She could use a Huffman 
code to send Bob each symbol one at a time, but this introduces rounding errors 
that accumulate as the message length grows. There is another scheme called 
arithmetic coding for which the average number of bits per symbol approaches 
H2(P) and the source coding theorem says that this is unimprovable. 

The definition of entropy using base 2 makes sense from the perspective of 
sending binary message. Mathematically, however, it is more convenient to define 


14.2 
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the entropy using the natural logarithm: 


1 
H(P)= X` pilog (=) . (14.3) 
ee Di 
i€[N]:pi>0 
This is nothing more than a scaling of the H2. Measuring information using base 
2 logarithms has a unit of bits, and for the natural logarithm the unit is nats. 
By slightly abusing terminology, we will also call H(P) the entropy of P. 


Relative Entropy 


Suppose that Alice and Bob agree to use a code that is optimal when X is sampled 
from distribution Q. Unbeknownst to them, however, X is actually sampled from 
distribution P. The relative entropy between P and Q measures how much longer 
the messages are expected to be using the optimal code for Q than what would be 
obtained using the optimal code for P. Letting p; = P(X = i) and q; = Q(X =i), 
assuming Shannon coding, working out the math while dropping [-] leads to the 
definition of relative entropy as 


D(P,Q= > pilog (~) - >a pilog (+) = 5 patos (%) 
ic[N]:p:>0 f i€[N]:pi>0 í i€[N]:pi>0 di 
(14.4) 


From the coding interpretation, one conjectures that D(P, Q) > 0. Indeed, this 
is easy to verify using Jensen’s inequality. Still poking around the definition, 
what happens when q; = 0 and p; = 0? This means that symbol 7 is superfluous 
and the value of D(P, Q) should not be impacted by introducing superfluous 
symbols. And again, it is not by the definition of the expectations. We also see 
that the sufficient and necessary condition for D(P, Q) < oo is that for each i 
with q; = 0, we also have p; = 0. The condition we discovered is equivalent to 
saying that P is absolutely continuous with respect to Q. Note that absolute 
continuity only implies a finite relative entropy when X takes on finitely many 
values (Exercise 14.2). 

This brings us back to defining relative entropy between probability measures 
P and Q on arbitrary measurable spaces (Q, F). When the support of P is 
uncountable, defining the entropy via communication is hard because infinitely 
many symbols are needed to describe some outcomes. This seems to be a 
fundamental difficulty. Luckily, the impasse gets resolved automatically if we 
only consider relative entropy. While we cannot communicate the outcome, for 
any finite discretisation of the possible outcomes, the discretised values can be 
communicated finitely, and all our definitions will work. Formally, a discretisation 
to [N] is specified by a F/2!%]-measurable map X : Q > [N]. Then the entropy 
of P relative Q can be defined as 


D(P,Q) = oD sup OS Qx), (14.5) 
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where Px is the push-forward of P on [N] defined by Px(A) = P(X € A). The 
inner supremum is over all F/2’-measurable maps. Informally we take all possible 
discretisations X (with no limit on the ‘fineness’ of the discretisation) and define 
D(P, Q) as the excess information when expecting to see X with X ~ Qx, while 
in reality X ~ Px. As we shall see soon, this is indeed a reasonable definition. 


THEOREM 14.1. Let (Q, F) be a measurable space, and let P and Q be measures 
on this space. Then, 


J log (Se) dP(w), ifP<«Q; 


oO, otherwise. 


D(P, Q) = 


Note that the relative entropy between P and Q can still be infinite even when 
P <Q. Note also that in the case of discrete measures, the above expression 
reduces to (14.4). For calculating relative entropies densities one often uses 
densities: If X is a common dominating o-finite measure for P and Q (that is, 
P X Aand Q < X both hold), then letting p = 22 and q = 22, if also P< Q, 


dX dX? 
f . dP dQ _ dP : : R 
the chain rule gives dO d ae which lets us write 


D(P,Q) = fros (2) dX. (14.6) 


This is probably the best-known expression for relative entropy and is often used 
as a definition. Note that for probability measures, a common dominating o-finite 
measure can always be bound. For example, A = P + Q always dominates both 
P and Q. 

Relative entropy is a kind of ‘distance’ measure between distributions P and 
Q. In particular, D(P, Q) = 0 whenever P = Q, and otherwise D(P,Q) > 0. 
However, strictly speaking, the relative entropy is not a distance because it 
satisfies neither the triangle inequality nor is it symmetric. Nevertheless, it serves 
the same purpose. 

The relative entropy between many standard distributions is often quite easy 
to compute. For example, the relative entropy between two Gaussians with means 
H1, u2 € R and common variance o? is 


Dne Na e, 


20? 
The dependence on the difference in means and the variance is consistent with 
our intuition. If yı is close to u2, then the ‘difference’ between the distributions 
should be small, but if the variance is very small, then there is little overlap, and 
the difference is large. The relative entropy between two Bernoulli distributions 
with means p,q € [0,1] is 


D(B(p), B(4)) = plog (4) + (1 — p) log G=) i 


where 0log(-) = 0. Due to its frequent appearance at various places, D(B(p), B(q)) 


14.2 Relative Entropy 190 


gets the honour of being abbreviated to d(p,q), which we have met before in 
Definition 10.1. 

We are nearing the end of our whirlwind tour of relative entropy. It remains 
to state the key lemma that connects the relative entropy to the hardness of 
hypothesis testing. 


THEOREM 14.2 (Bretagnolle-Huber inequality). Let P and Q be probability 
measures on the same measurable space (N), F), and let A € F be an arbitrary 
event. Then, 


é 1 
P(A) + Q(4°) > Łexp (-D(P,Q)) , (14.7) 
where Ac = Q \ A is the complement of A. 


The proof may be found at the end of the chapter, but first some interpretation 
and a simple application. Suppose that D(P, Q) is small; then P is close to Q 
in some sense. Since P is a probability measure, we have P(A) + P(A‘) = 1. 
If Q is close to P, then we might expect that P(A) + Q(A‘) should be large. 
The purpose of the theorem is to quantify just how large. Note that if P is not 
absolutely continuous with respect to Q, then D(P,Q) = oo, and the result is 
vacuous. Also note that the result is symmetric. We could replace D(P, Q) with 
D(Q, P), which sometimes leads to a stronger result because the relative entropy 
is not symmetric. 

Returning to the hypothesis-testing problem described in the previous chapter, 
let X be normally distributed with unknown mean u € {0,A} and variance 
a? > 0. We want to bound the quality of a rule for deciding what is the real mean 
from a single observation. The decision rule is characterised by a measurable 
set A C R on which the predictor guesses u = A (it predicts u = 0 on the 
complement of A). Let P = N(0,07) and Q = N(A,o?). Then the probability 
of an error under P is P(A), and the probability of error under Q is Q(A‘). The 
reader surely knows what to do next. By Theorem 14.2, we have 


1 1 A? 
P(A) + Q(4°) > 3p (- D(P,Q)) = 5 exp| -573 ) - 

2 2 20 

If we assume that the signal-to-noise ratio is small, A?/a? < 1, then 
1 1 3 
“\> > 
P(A) +Q(4") zep- 25, 

which implies max { P(A), Q(A‘°)} > 3/20. This means that no matter how we 
chose our decision rule, we simply do not have enough data to make a decision 
for which the probability of error on either P or Q is smaller than 3/20. 


Proof of Theorem 14.2 For reals a,b, we abbreviate max {a,b} = a V b and 
min {a,b} = a A^ b. The result is trivial if D( P,Q) = oo. On the other hand, by 
Theorem 14.1, D(P, Q) < co implies that P< Q. Let v = P+Q. Then P,Q & v, 
which by Theorem 2.13 ensures the existence of the Radon—Nikodym derivatives 


14.3 
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p= aP and q = 1, By Eq. (14.6), D(P, Q) = f plog (2) dv. For brevity, when 
writing integrals with respect to v, in this proof, we will drop dv. Thus, we will 
write, for example, f plog(p/q) for the above integral. 

Instead of (14.7), we prove the stronger result that 


J pna 3 eP DP,Q). (14.8) 


This indeed is sufficient since f p ^q = faprha+ fyp^gq < Jap + Sac = 
P(A) + Q(A°). We start with an inequality attributed to French mathematician 
Lucien Le Cam, which lower-bounds the left-hand side of Eq. (14.8). The inequality 


states that 
1 2 
ferz3( fva) . (14.9) 


How do we get this inequality? Starting from the right-hand side above, using 
pq = (pA q)(p V q) and Cauchy-Schwarz we get 


(fm) = (J vara) (fos) (Ger) 


Now, using p^q+pVq = p +q, the proof is finished by substituting 
{pV q=2-f p^q < 2 and dividing both sides by two. It remains to lower-bound 
the right-hand side of (14.9). For this, we use Jensen’s inequality. First, we write 
(-)? as exp(2log(-)) and then move the log inside the integral: 


(| vi)’ =o (2 fv) en (208 ff) 
zew(2f rz oe(S))=e0(- J ve(4)) 
= exp (= f piog (2) ) = exn(—D7,Q)) 


In the fourth and the last step, we used that since P < Q, q = 0 implies p = 0, 
and so p > 0 implies q > 0, and eventually pq > 0. The result is completed by 
chaining the inequalities. 


Notes 
1 A code c : N+ + {0,1}* is uniquely decodable if i1,...,in > c(i1)---c(in) 
is injective, where on the right-hand side the codes are simply concatenated. 


Kraft’s inequality states that for any uniquely decodable code c, 


AO ee, (14.10) 


bo 


ow 


Aa 
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Furthermore, for any (¢,)°, satisfying 07°, 27% < 1, there exists a prefix 
code c : Nt — {0,1}* such that €(c(i)) = 4;. The second part justifies our 
restriction to prefix codes rather than uniquely decodable codes in the definition 
of the entropy. 

The supremum in the definition given in Eq. (14.5) may often be taken over 
a smaller set. Precisely, let (7,G) be a measurable space and suppose that 
G = o0(F) where F is a field. Note that a field is defined by the same axioms 
as a o-algebra except that being closed under countable unions is replaced by 
the condition that it be closed under finite unions. Then, for measures P and 
Q on (4,G), it holds that 


D(P,Q) = sup DUG 


where the supremum is over F/2!"I-measurable functions. This result is known 
as Dobrushin’s theorem. 

In the proof of Theorem 14.2 we used the inequality P(A) + Q(A‘) > [pAq. 
Looking at the proof, it is not hard to see that the inequality becomes an equality 
when A = {p < q} = {q/p => 1}. Reader’s familiar with statistical decision 
theory may recognize that this is a special case of the Neyman-Pearson 
lemma which states that the most powerful test among all statistical tests in a 
simple hypothesis testing problem at a given significance level is the likelihood 
ratio test. Exercise 14.14 explores this connection. 

How tight is Theorem 14.2? We remarked already that D(P, Q) = 0 if and only 
if P = Q. But in this case, Theorem 14.2 only gives 


1 = P(A) +Q) > 5 exp(-D(P,Q)) = 5, 


which does not seem so strong. From where does the weakness arise? The 
answer is in Le Cam’s inequality, Eq. (14.9), which can be refined by 


(fo) < (Joos) (frve)=(fors) (foe) 


By solving the quadratic inequality, we have 


P(A) +Q) > [ pAg>1- 1- (fva) 


> 1- y1- exp(-D(P,Q)), (14.11) 


which gives a modest improvement on Theorem 14.2 that becomes more 
pronounced when D(P, Q) is close to zero, as demonstrated by Fig. 14.2. This 
stronger bound might be useful for fractionally improving constant factors in 
lower bounds, but we do not know of any application for which it is really 
crucial, and the more complicated form makes it cumbersome to use. Part of 
the reason for this is that the situation where D(P, Q) is small is better dealt 
with using a different inequality, as explained in the next note. 
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—0.5 


Figure 14.2 Tightening the inequality of Le Cam. Here, x = exp(— D(P,Q)). Higher 
values are better as the figure shows lower bounds on P(A) + Q(A‘). The black 
curve corresponds to (14.7), the blue curve corresponds to (14.11) and the red curve 
corresponds to (14.12) (Pinsker’s inequality). As can be seen, (14.11) indeed dominates 
(14.7). Also, while for x small (D(P, Q) large), Pinsker is vacuous, while the others are 
not, for x larger (D(P,Q) near zero) Pinsker dominates (14.11), though in the limit 
when D(P, Q) = 0, both Pinsker and (14.11) give the “correct” value of 1. 


5 Another inequality from information theory is Pinsker’s inequality, which 
states for measures P and Q on the same probability space (Q, F) that 


P,Q) = sup P(A) - QU) < 5 D(P.Q). (14.12) 


The quantity on the left-hand side is called the total variation distance 
between P and Q, which is a distance on the space of probability measures on 
a probability space. From this we can derive for any measurable A € F that 


P(A) +Q(A*) > 1- pea =! z i (aED) | 


(14.13) 


Examining Fig. 14.2 shows that this is an improvement on Eq. (14.11) when 
D(P, Q) is small. However, we also see that in the opposite case, when D(P, Q) 
is large, Eq. (14.13) is worse than Eq. (14.11), or the inequality in Theorem 14.2. 


6 We saw the total variation distance in Eq. (14.12). There are two other 
‘distances’ that are occasionally useful. These are the Hellinger distance 
and the x-squared distance, which, using the notation in the proof of 


14.4 
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Theorem 14.2, are defined by 


h(P.Q) = yJ vr- J = j (1- fva) aay 


epg f e2 -fe 1. (14.15) 


q q 
The Hellinger distance is bounded and exists for all probability measures P 
and Q. A necessary condition for the y?-distance to exist is that P < Q. Like 
the total variation distance, the Hellinger distance is actually a distance (it is 
symmetric and satisfies triangle inequality), but the y?-‘distance’ is not. It is 
possible to show (Tsybakov [2008], chapter 2) that 


(P,Q)? < h(P, Q}? < D(P,Q) < x°(P,Q). (14.16) 


All the inequalities are tight for some choices of P and Q, but the examples 
do not chain together, as evidenced by Pinsker’s inequality, which shows that 
6(P,Q)? < D(P, Q)/2 (which is also tight for some P and Q). 

7 The entropy for distribution P was defined as H(P) in Eq. (14.3). If X isa 
random variable, then H(X) is defined to be the entropy of the law of X. This 
is a convenient notation because it allows one to write H(f(X)) and H(XY) 
and similar expressions. 
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Exercises 


14.1 Let P be a probability distribution on Nt and p; = P({i}). Show that for 
any prefix code c: Nt + {0,1}*, it holds that 


> Pileli) > H(P). 
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HINT Use Kraft’s inequality from Note 1. 
14.2 Find probability measures P and Q on N* with P < Q and D(P, Q) = œ. 


14.3 Prove the inequality in Eq. (14.10) for prefix free codes c. 


HINT Consider an infinite sequence of independent Bernoulli random variables 
(Xn) where Xn ~ B(1/2). Viewing X as an infinite binary string, what is the 
probability that X has a prefix that is a code for some symbol? 


14.4 Let (Q, F) be a measurable space, and let P,Q: F — [0,1] be probability 
measures. Let a < b and X : Q —> [a,b] be a F-measurable random variable. 
Prove that 


< (b— a)ð(P, Q). 


J xware)— f Xido) 


14.5 (ENTROPY INEQUALITIES) Prove that each of the inequalities in Eq. (14.16) 
is tight. 


14.6 (COUNTING MEASURE ABSOLUTE CONTINUITY AND DERIVATIVES) Let Q be 
a countable set and p : Q — [0, 1] be a distribution on Q so that ` eg p(w) = 1. 
Let P be the measure associated with p, which means that P(A) = X e4 p(w). 
Recall that the counting measure p is the measure on (0, 2%) given by u(A) = |A] 
if A is finite and u(A) = œ otherwise. 


(a) Show that P is absolutely continuous with respect to p. 
(b) Show that the Radon-Nykodim dP/dy exists and that dP/du(w) = p(w). 


14.7 (RELATIVE ENTROPY FOR GAUSSIAN DISTRIBUTIONS) For each i € {1,2}, 
let u; € R, o? > 0 and P; = N (m, 02). Show that 


1 2 2 _ 9 
D(P,, P2) = 7 (x (2) i a 1) (u u2) , 


2 
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14.8 Let À be the Lebesgue measure on (R, B(R)). Find 


(a) a probability measure (R, B(R)) that is not absolutely continuous with 
respect to A; and 

(b) a probability measure P on (R,8(R)) that is absolutely continuous to A 
with D(P, Q) = œœ where Q = N (0,1) is the standard Gaussian measure. 


14.9 (RELATIVE ENTROPY BETWEEN PUSH-FORWARD MEASURES) Let P and 
Q be measures on (Q, F) and let Z be a random element over this space taking 
values in (Z,G). Let Pz (Qz) be the push-forward of P (respectively, Q) under 
Z: Pz(U) = P(Z € U) (resp., Qz(U) = Q(Z € U)). Show that if P< Q then 
D(Pz,Qz) = f log 48(Z(w))dP(w). 


HıNT Show that if P< Q then Pz « Qz and Q almost surely, T (Z(w)) = 


a2 (Z(w)). 
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14.10 (DATA PROCESSING INEQUALITY) Let P and Q be measures on (Q, F), 
and let G be a sub-o-algebra of F and Pg and Qg be the restrictions of P and Q 
to (0,G). Show that D(Pg,Qg) < D(P, Q). 


HINT Use the definition of relative entropy from Eq. (14.5). 
14.11 Let (Q, F) be a measurable space and P,Q : B(R) x Q —> [0,1] be a pair 
of probability kernels from (Q, F) to (R, 8(R)). Prove that 

V ={weE: D(P(-|w), Q(-|w)) = oof EF. 
HINT Apply Dobrushin’s theorem to the field of finite unions of rational-valued 
intervals in R. 


14.12 (CHAIN RULE) Let P and Q be measures on (R”, B(R”)), and for t € [n], 
let X;(x) = a; be the coordinate project from R” —> R. Then let P, and Q; be 


regular versions of X, given X1,...,X;~1; under P and Q, respectively. Show 
that 
D(P, Q) = X` Ep [D(Pi(-| Xi,- Xt-1)s Qe: | Xa...) Xe-1))] (14.17) 
t=1 


Hint This is a rather technical exercise. You will likely need to apply a 
monotone class argument [Kallenberg, 2002, theorem 1.1]. For the definition 
of a regular version, see [Kallenberg, 2002, theorem 5.3] or Theorem 3.11. 
Briefly, P, is a probability kernel from (R‘~', 8(R‘~')) to (R, B(R)) such that 
P(Alai,...,%1-1) = P(X: € A| X1,...,Xz¢-1) with P-probability one for all 
A € 8(R). 


14.13 (CHAIN RULE (CONT.)) Let P and Q be measures on (R”, 8(R”)), and 
for t € [n], let X(x) = x+ be the coordinate project from R” — R. Then let P, 
and Q be regular versions of X; given X1,...,Xz-1 under P and Q, respectively. 
Let 7 be a stopping time adapted to the filtration generated by X1,..., Xn with 
T € [n] almost surely. Show that 


D(P F, Qe) = ip XO D(Pi(-| Xa,- ni »X+-1), Ql | X1, pas ,Xı—1)) 
t=1 


14.14 (NEYMAN-PEARSON LEMMA) The simple hypothesis testing problem is 
specified by a measurable space (4, F) and two distinct probability distributions 
over this space, P and Q. The problem is to decide based on observing a random 
element X taking values in ¥ whether its distribution follows P or Q. The decision 
rule, or test, is a measurable map f from Æ to (say) {P,Q}: when f(x) = P, the 
rule decides in favor of P given the observation x, otherwise it decides in favor of 
Q. There are two ways a test can be wrong and a test is characterized by the 
probability of an incorrect decision in the two cases. In particular, the probability 
of the test making an error when X follows P is e(f, P) := f1{f(x) = Q} P(dz) 
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and the probability of error when X follows Q is e(f,Q) := [J I{f (x) = P} Q(dz). 
Show that the following hold: 


(a) Ifa test f is such that f(x) = P if p(x) > nq(x) and f(x) = Q if p(x) < n(x) 
and a = e(f,P) then e(f,Q) = min{e(f’,Q) : e(f’,P) < a}. Here, 
p = dP/dv and q = dQ/dv where v is a common dominating measure 
of P and Q. 

(b) If f, as postulated in the previous part exist, and f’ is a test such that 
e(f’,Q) = e(f,Q) and e(f’, P) < a then e(f’, P) = a. Furthermore, f’ = f 
holds except perhaps over the union of the set {x : p(x) = nq(x)} and a set 
that has zero measure under both P and Q. 


In statistics, the upper bound a on e(f, P) is called the significance level of 
the test, while 1 — e( f, Q) is called its power, and breaking the symmetry, 
e(f, P) is called the type-I error, while e(f,Q) is called the type-II error. 
Since the decision rule in the first part of the lemma can be expressed as a 
function of the ratio p/q of densities (or likelihoods), the test in this part 
is called a likelihood ratio test. The first part of the exercise says that the 
most powerful test for a given significance level are the likelihood ratio 
tests and the second part says that these are the only most powerful tests 
(uniqueness). 
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15.1 


Minimax Lower Bounds 


After the short excursion into information theory, let us return to the world 
of k-armed stochastic bandits. In what follows, we fix the horizon n > 0 and 
the number of actions k > 1. This chapter has two components. The first is 
an exact calculation of the relative entropy between measures in the canonical 
bandit model for a fixed policy and different bandits. In the second component, 
we prove a minimax lower bound that formalises the intuitive arguments given in 
Chapter 13. 


Relative Entropy Between Bandits 


The following result will be used repeatedly. Some generalisations are provided 
in the exercises. 


LEMMA 15.1 (Divergence decomposition). Let v = (Pi,...,P,) be the reward 
distributions associated with one k-armed bandit, and let v' = (P{,...,Py) be the 
reward distributions associated with another k-armed bandit. Fix some policy 7 
and let P, = Pyr and Py = Pyn be the probability measures on the canonical 
bandit model (Section 4.6) induced by the n-round interconnection of t and v 
(respectively, x and v'). Then, 


D(P,, Py) = > E [T;(n)] D(P;, P!) . (15.1) 

i=1 
Proof Assume that D(P;, P!) < oo for all i € [k]. It follows that P; < P!. Define 
A= yg P; + P/, which is the measure defined by \(A) = EE (P(A) + P!(A)) 


for any measurable set A. Theorem 14.1 shows that, as long as de » < +00, 


dP 
D(P_, Py) = E, og (E )| i 


Recalling that p is the counting measure over [k], we find that the Radon-Nikodym 
derivative of P, with respect to the product measure (p x A)” is given in Eq. (4.7) 
as 


n 
Dag iy Binet An, En) = [| melai (a1, Pi pea pa (Tt). 
t=1 
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The density of P, is identical except that pa, is replaced by pa,- Then 


D n 


dl V a 
log ——(a1,%1,...,@n,2n) = Y log? (2) ; (15.2) 


dP, 2 8 p (a) 


where we used the chain rule for Radon—Nikodym derivatives and the fact that 
the terms involving the policy cancel. Taking expectations of both sides, 


p 


m dl v Z a PA al 
ty |log ——(Ay1, X1,.--, An, Xn)} = Evy |lo + 4 
| 8 op, | 1 1 ) 5 | e D (Xi) 


1 
t 


and 


= E, [D(Pa,, P4,)] 3 


; DA, oa le DA, (Xt) 
4 lo = by “y lo 
| © Wa (Ka Sp 


where in the second equality we used that under P,(-|A+), the distribution of X; 
is dP4, = pa,dd. Plugging back into the previous display, 


: dP, z PA | 
ty |log — (A1, X1,..., An, Xn)| = Ev |lo i 
| 8 op, | i 1 ) 2, | P o 


k n 
= X E, [D(Pa., P4,)] = > mn ba {At = i} D(Pa,, P4,) 


i=1 
When the right-hand side of (15.1) is infinite, by our previous calculation, it is 
not hard to see that the left-hand side will also be infinite. 


We note in passing that the divergence decomposition holds regardless of 
whether the action set is discrete or not. In its more general form, the sum 
over the actions must be replaced by an integral with respect to an appropriate 
non-negative measure, which generalises the expected number of pulls of arms. 
For details, see Exercise 15.8. 


Minimax Lower Bounds 


Recall that €,(1) is the class of Gaussian bandits with unit variance, which can 
be parameterised by their mean vector u € R*. Given p € R*, let v, be the 
Gaussian bandit for which the ith arm has reward distribution N (u, 1). 


THEOREM 15.2. Let k >1 andn > k—1. Then, for any policy n, there exists a 
mean vector u € [0,1]* such that 
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Action 
Tndex 


Figure 15.1 The idea of the minimax lower bound. Given a policy and one environment, 
the evil antagonist picks another environment so that the policy will suffer a large regret 
in at least one environment. 


Since v, € EX-(1), it follows that the minimax regret for €f-(1) is lower-bounded 
by the right-hand side of the above display as soon as n > k — 1: 


; 1 
R(ER(1)) > eV Dn. 
The idea of the proof is illustrated in Fig. 15.1. 


Proof Fix a policy m. Let A € [0,1/2] be some constant to be chosen later. As 
suggested in Chapter 13, we start with a Gaussian bandit with unit variance 
and mean vector u = (A,0,0,...,0). This environment and 7 give rise to the 
distribution P,,,, on the canonical bandit model (Hn, Fn). For brevity we will 
use P, in place of P,,,., and expectations under P, will be denoted by E,,. To 
choose the second environment, let 


i =argmin,,, E,,[Tj(n)]. 


Since = 2u [L}(n)] = n, it holds that E,,[T;(n)] < n/(k—1). The second bandit 
is also Gaussian with unit variance and means 


w = (A,0,0,...,0,2A,0,...,0), 


where specifically u; = 2A. Therefore, uj = ju, except at index i and the optimal 
arm in v, is the first arm, while in v, arm 7 is optimal. We abbreviate P, = Py, at 
Lemma 4.5 and a simple calculation lead to 


nA 


Rn(T, va) > Pr(Ti(n) < TS and Rn(T, vw) > Pu (Ti(n) > n/2) at 


Then, applying the Bretagnolle-Huber inequality from the previous chapter 
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(Theorem 14.2), 


Ry lt, Vu) + Aa Ya) > m (P (Ti(n) < n/2) + Pu (Ti(n) > n/2)) 


(15.3) 


It remains to upper-bound D(P,,,P,,,). For this, we use Lemma 15.1 and the 
definitions of u and py’ to get 


DP, Pur) = E,.[Ti(m)] DIN (0, 1), N (24, 1)) = E, [Ti(n)] 


Plugging this into the previous display, we find that 


nA? 
Bal 1, Uy) + Rolt, vw) > i exp ( > ) ; 


The result is completed by choosing A = y (k — 1)/4n < 1/2, where the inequality 
follows from the assumptions in the theorem statement. The final steps are lower 
bounding exp(—1/2) and using 2max(a, b) > a+ b. 


We encourage readers to go through the alternative proof outlined in 
Exercise 15.2, which takes a slightly different path. 


Notes 


1 We used the Gaussian noise model because the KL divergences are so easily 
calculated in this case, but all that we actually used was that D(P;, P/) = 
O((ui — w.)?) when the gap between the means A = p; — ui is small. While 
this is certainly not true for all distributions, it very often is. Why is that? Let 
{P,, : u € R} be some parametric family of distributions on Q and assume that 
distribution P, has mean u. Assuming the densities are twice differentiable 
and that everything is sufficiently nice that integrals and derivatives can be 
exchanged (as is almost always the case), we can use a Taylor expansion about 


N 


w 


Aa 


15.3 Notes 202 


L to show that 


o 1 oe 
— D(P,, PaA A+- = D(P,, Puta A? 
OA ( H H+ 6 9 ðA2 ( H+ u+ Hk 


_ a dP, ee 


2 
a dP,sn 
= lo ita) 
ban e( dP, 


0 dP 
QA dP, 
0 dPy ta 


=-~, | “ap 
aA Jy aP, 


D(P,, Puta) x 


A=0 


A=0 


dP,A + 


A=0 


A+ Ip)? 


A=0 


= 51(wd’, 


where I(u), introduced in the second line, is called the Fisher information 
of the family (P,,),, at u. Note that if A is a common dominating measure for 
(P..+a) for A small, dP, +a = Py+add and we can write 


82 


I(u) =- JA? log Puta pyudd , 


A=0 


which is the form that is usually given in elementary texts. The upshot of all 
this is that D(P,,, P+) for A small is indeed quadratic in A, with the scaling 
provided by J(), and as a result the worst-case regret is always O(Vnk), 
provided the class of distributions considered is sufficiently rich and not too 
bizarre. 

We have now shown a lower bound that is Q(/nk), while many of the upper 
bounds were O(log(n)). There is no contradiction because the logarithmic 
bounds depended on the inverse suboptimality gaps, which may be very large. 
Our lower bound was only proven for n > k — 1. In Exercise 15.3, we ask you 
to show that when n < k —1, there exists a bandit such that 

n(2k—-n—1)_n 


> ? 
Rn 2 2k 2a 


The method used to prove Theorem 15.2 can be viewed as a generalisation 
and strengthening of Le Cam’s method in statistics. Recall that Eq. (15.3) 
establishes that for any u and p’, 


A 
od Rn(T, v) > > exp(=D(P,, Pat). 


To explain Le Cam’s method, we need a little notation. Let ¥ be an outcome 
space, P a set of measures on ¥ and 6: P — O, where (O, d) is a metric space. 
An estimator is a function 6: 7” —> ©. Le Cam’s method is used for proving 
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minimax lower bounds on the minimax expected error p% of estimating 6(P) 
from i.i.d. data drawn from P where estimation errors are measured using d 
and 


ps, = inf sup Exy,...x,«p [dO(X1,.--,Xn),O(P))] (15.4) 
6 PEP 

The idea is to choose Po, P, € P to maximise d(6(P 5), 0(P1)) exp(—n D(Pp, P1)), 
on the basis that for any Po, Pa € P, 


A 
Pn 2 g exp(—nD(Po, Pi) , (15.5) 


where A = d(0(Po),0(P1)). There are two differences compared to a bandit 
lower bound: in the bandit bounds, (i) we deal with the sequential setting, and 
(ii) having chosen Po we choose P, in a way that depends on the algorithm. 
This provides a much needed extra boost, without which the method would be 
unable to capture how the characteristics of P are reflected in the minimax 
risk (or regret, in our case). 


Bibliographic Remarks 


The first work on lower bounds that we know of was the remarkably precise 
minimax analysis of two-armed Bernoulli bandits by Vogel [1960]. The Bretagnolle— 
Huber inequality (Theorem 14.2) was first used for bandits by Bubeck et al. 
[2013b]. As mentioned in the notes, the use of this inequality for proving lower 
bounds is known as Le Cam’s method in statistics [Le Cam, 1973]. The proof 
of Theorem 15.2 uses the same ideas as Gerchinovitz and Lattimore [2016], 
while the alternative proof in Exercise 15.2 is essentially due to Auer et al. 
[1995], who analysed the more difficult case where the rewards are Bernoulli (see 
Exercise 15.4). Yu [1997] describes some alternatives to Le Cam’s method for the 
passive, statistical setting. These alternatives can be (and often are) adapted to 
the sequential setting. 


Exercises 


15.1 (LE CAM’S METHOD) Establish the claim in Eq. (15.5). 


15.2 (ALTERNATIVE PROOF OF THEOREM 15.2) Here you will prove 
Theorem 15.2 with a different method. Let c > 0 and A = 2c,/k/n, and 
for each i € {0,1,...,k}, let w € R* satisfy u = I{i = j} A. (Note that 
u® = 0.) Further abbreviate the notation in the proof of Theorem 15.2 by letting 


ull = E,c[-]. 
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(a) Use Pinsker’s inequality (Eq. 14.12) and Lemma 15.1 and the result of 
Exercise 14.4 to show 


u [Ti(n)] < Eo[Ti(n)] + ny 7 to[T;(n)] = Eo[Ti(n)] + eV nkEo[Ti(n)] . 


(b) Using the previous part, Jensen’s inequality and the identity yy f0(Zi(n)] = 
n, show that 


k k 
5 ulTi(n)] < n+c)` /nkEo[Ti(n)] < n+ ckn. 


i=l i=1 


(c) Let Ri = Rn(7,G,,). Find a choice of c > 0 for which 


k k 
Ds R= Adin — E,[T;(n)]) > A (nk — n — ckn) 


k Jk 
= 2| Ë (nk -n — ebm) > ZÈ Z 
n 8 n 


(da) Conclude that there exists an 7 € [k] such that 


Ri > L Vin. 


The method used in this exercise is borrowed from Auer et al. [2002b] and 
is closely related to the lower-bound technique known as Assouad’s method 
in statistics [Yu, 1997]. 


15.3 (LOWER BOUND FOR SMALL HORIZONS) Let k > 1 and n < k. Prove that 
for any policy m there exists a Gaussian bandit with unit variance and means 
p [0, 1]* such that Rp(7, vu) > n(2k — n — 1)/(2k) > n/2. 


15.4 (LOWER BOUNDS FOR BERNOULLI BANDITS) Recall from Table 4.1 that 
E is the set of k-armed Bernoulli bandits. Show that there exists a universal 
constant c > 0 such that for any 2 < k < n, it holds that: 


Rž (Ef) = inf sup Rn(r, v) > cVnk. 


veer 


Hint Use the fact that KL divergence is upper bounded by the x-squared 
distance (Eq. (14.16)). 


15.5 In Chapter 9 we proved that if m is the MOSS policy and v € €&,(1), then 


Ri(t,v) <O | Vkn+ 5 Ail, 


i:A;>0 
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where C > 0 is a universal constant. Prove that the dependence on the sum 
cannot be eliminated. 


HINT You will have to use that T;(t) is an integer for all t. 


15.6 (LOWER BOUND FOR EXPLORE-THEN-COMMIT) Let ETC,,,, be the explore- 
then-commit policy with inputs n and m respectively (Algorithm 1). Prove that 
for all m, there exists a u € [0,1]* such that 


Rn (ETCnm, Vu) > emin fn, a 
where c > 0 is a universal constant. 


15.7 (STOPPING-TIME VERSION OF DIVERGENCE DECOMPOSITION) Consider 
the setting of Lemma 15.1, and let F; = o(A1, X1, ..., Az, Xz) and 7 be an 
(F;)-measurable stopping time. Then, for any random element X that is F,- 
measurable, 
k 
D(PLx, Px) < > E[T:(7)] DP, P!), 


i=l 


where P,x and Py x are the laws of X under v and v’ respectively. 


HINT Use Exercise 14.10 and Exercise 14.9. 


15.8 (DIVERGENCE DECOMPOSITION FOR MORE GENERAL ACTION SPACES) The 
purpose of this exercise is to show that the divergence decomposition lemma 
(Lemma 15.1) continues to hold for more general action spaces (A, G). Starting 
from the set-up of Section 4.7, let P, = Pyr and Py = Pyr be the measures on the 
canonical bandit model induced by the interconnection of 7 and v (respectively, 
m and v’). 


(a) Prove that 
D(P,, PL’) = D(P,, P!) dG, (a), (15.6) 
A 


where G', is a measure on (A, G) defined by G, (B) = E [X 1 {4 € B}. 
(b) Prove that 


n 


Dee Ps) 


t=1 


D(P,,P,”) =E 


HINT Use an appropriately adjusted form of the chain rule for relative entropy 
from Exercise 14.12. 
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Instance-Dependent Lower Bounds 


In the last chapter, we proved a lower bound on the minimax regret for subgaussian 
bandits with suboptimality gaps in [0,1]. Such bounds serve as a useful measure 
of the robustness of a policy, but are often excessively conservative. This chapter 
is devoted to understanding instance-dependent lower bounds, which try to 
capture the optimal performance of a policy on a specific bandit instance. 

Because the regret is a multi-objective criteria, an algorithm designer might 
try and design algorithms that perform well on one kind of instance or another. 
An extreme example is the policy that chooses A; = 1 for all t, which suffers 
zero regret when the first arm is optimal and linear regret otherwise. This is a 
harsh trade-off, with the price for reducing the regret from logarithmic to zero 
on just a few instances being linear regret on the remainder. Surprisingly, this 
is the nature of the game in bandits. One can assign a measure of difficulty to 
each instance such that policies performing overly well relative to this measure 
on some instances pay a steep price on others. The situation is illustrated in 
Fig. 16.1. 


n over-specialised 


reasonable, not instance optimal 


minimax optimality limit 
instance optimality limit 


Regret 
S 


Instances 


Figure 16.1 On the z-axis, the instances are ordered according to the measure of difficulty, 
and the y-axis shows the regret (on some scale). In the previous chapter, we proved that 
no policy can be entirely below the horizontal ‘minimax optimal’ line. Theorem 16.4 in 
this chapter show that if the regret of a policy is below the ‘instance optimal’ line at 
any point, then it must have regret above the shaded region for other instances. For 
example, the ‘over-specialized’ policy. 
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In finite time, the situation is a little messy, but if one pushes these ideas to 
the limit, then for many classes of bandits one can define a precise notion of 
instance-dependent optimality. 


Asymptotic Bounds 


We need to define exactly what is meant by a reasonable policy. If one is only 
concerned with asymptotics, then a rather conservative definition suffices. 


DEFINITION 16.1. A policy 7 is called consistent over a class of bandits € if for 
all v € € and p > 0, it holds that 


lim alat] “i 


noo nP 


=0. (16.1) 
The class of consistent policies over E is denoted by Ieons(E). 


Theorem 7.1 shows that UCB is consistent over E& (1). The strategy that 
always chooses the first action is not consistent on any class E unless the first 
arm is optimal for every v € E. 


Consistency is an asymptotic notion. A policy could be consistent and yet 
play A; = 1 for all t < 101°. For this reason, an assumption of consistency 
is insufficient to derive non-asymptotic lower bounds. In Section 16.2, we 
introduce a finite-time version of consistency that allows us to prove finite- 
time instance-dependent lower bounds. 


Recall that a class € of stochastic bandits is unstructured if E = M1 x---x Mk 
with M,,...,Mx sets of distributions. The main theorem of this chapter is a 
generic lower bound that applies to any unstructured class of stochastic bandits. 
After the proof, we will see some applications to specific classes. Let M be a set 
of distributions with finite means, and let u : M — R be the function that maps 
P € M to its mean. Let u* € R and P € M have p(P) < u* and define 


ding(P, u* = inf {D(P,P’): u(P' a 
intl Le, M) putt ( ’ ) Lu \>e} 
THEOREM 16.2. Let E = Mı x---x Mp and T €E Ieons(E) be a consistent policy 
over E. Then, for all v = (P;)k_, € E, it holds that 
A; 


Rn 
lim inf ——~ > c*(v,€) = ; 
noo log(n) 7 ee) PD) ding (Pi, w*, Mi) 


(16.2) 


where A; is the suboptimality gap of the ith arm in v and u* is the mean of the 
optimal arm. 


Proof Let u; be the mean of the ith arm in v and di = ding(P;, u*, Mi). The 
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result will follow from Lemma 4.5, and by showing that for any suboptimal arm 
i it holds that 


lim inf Ert Pa] 


> 
mice logn) = 


1 

T 

Fix a suboptimal arm i, and let € > 0 be arbitrary and v = (P;)}_; € E bea 
bandit with P; = P; for j # i and P; E€ M; be such that D(Pi, Pi) < di + € and 
u(P!) > u*, which exists by the definition of d;. Let u’ € R* be the vector of means 
of distributions of v’. By Lemma 15.1, we have D(Pyr, Pun) < Evx[Ti(n)](di +€), 
and by Theorem 14.2, for any event A, 


1 1 
Pym(A) + Prrn(A®) 2 5 exp (— D(Pin, Pure) 2 5 exp (—EvalTi(n)](di + €)) - 
Now choose A = {T;(n) > n/2}, and let Rn = R»(7,v) and Ri, = R,(7,v’). 
Then, 


Ra + Ri, > © (Prl AH Pre lA- 1") 


> 5 min {A,, u; — u“ } (Prr (A) + Pra (AY) 


V 


> T min {A;, ui — u*} exp (-Evr[Ti(n)](d: + €)) - 


Rearranging and taking the limit inferior leads to 


n min Aiui—p"} 

ioe DEAE 
_ . e Eva [Zi(n)] 1 22? ( T(Rn FR) ) 
lim inf 2 lim inf 
n>œ log(n) dj +E n>% log(n) 

1 l a / 
= 1 — lim sup og (Rn + Rn) = 1 , 
ate n=oo log(n) di +€ 


where the last equality follows from the definition of consistency, which says 
that for any p > 0, there exists a constant Cp such that for sufficiently large n, 
Rn + Ri, < Cyn”, which implies that 

plog(n) + log(Cp) 


l i f 


which gives the result since p > 0 was arbitrary and by taking the limit as € 
tends to zero. 


Table 16.1 provides explicit formulas for dint(P, u*, M) for common choices of 
M. The calculation of these quantities are all straightforward (Exercise 16.1). 
The lower bound and definition of c*(v, E) are quite fundamental quantities in 
the sense that for most classes €, there exists a policy m for which 


alm) = č (v, £) forall ve €E. (16.3) 


This justifies calling a policy asymptotically optimal on class € if Eq. (16.3) 
holds. For example, UCB from Chapter 8 and KL-UCB from Chapter 10 are 
asymptotically optimal for Ex-(1) and Ef, respectively. 


16.2 


16.2 Finite-Time Bounds 209 


M P dint (P, HŽ, M) 
— p*\2 
{Nin o?) : p ER} Nuy ow 


Wino?) : p ER, 0? € (0,00)} Nio?) slog (14 Boe") 


l o e = ö 1l-—yp 
{B(u) : u € [0, 1} B(u) whos (4) +0 mio (= ) 
{U(a, b) k a,b € R} U (a,b) log (: 2((a ae. e2) 


Table 16.1 Expressions for dint for different parametric families when the mean of P is less 
than u*. 


Finite-Time Bounds 


By making a finite-time analogue of consistency, it is possible to prove a finite- 
time instance-dependent bound. First, a lemma that summarises what can be 
obtained by chaining the Bretagnolle-Huber inequality (Theorem 14.2) with the 
divergence decomposition lemma (Lemma 15.1). 


LEMMA 16.3. Let v = (P;) and v' = (P!) be k-armed stochastic bandits that 
differ only in the distribution of the reward for action i € |k]. Assume that i is 
suboptimal in v and uniquely optimal in v'. Let X = pi(v') — ui(v). Then, for 
any policy 7, 


log ( ZRA-A A0) + Jog(n) — log(Rn(v) + Rn(v')) 


ivr Li(n)] > . (16.4) 

The lemma holds for finite n and any v and can be used to derive finite- 
time instance-dependent lower bounds for any environment class € that is rich 
enough. The following result provides a finite-time instance-dependence bound 
for Gaussian bandits where the asymptotic notion of consistency is replaced by 
an assumption that the minimax regret is not too large. This assumption alone 
is enough to show that no policy that is remotely close to minimax optimal can 
be much better than UCB on any instance. 


THEOREM 16.4. Letv € Ek, be a k-armed Gaussian bandit with mean vector 
u € R! and suboptimality gaps A € [0,00)*. Let N be a nonempty subset of 
natural numbers, 


E(v) = {v € ER: pm") € [ui og + 2A], 1< i< k}. 


Suppose C > 0 and p € (0,1) are constants and x is a policy such that 
Ri(m,uv') < Cn? foralln E€ N and v' € E(v). Then, for any £ € (0,1] and 
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neN, 


+ 
(1 — p) log(n) + log (2) 
(16.5) 
Ai 
Proof Fix n € N. Let i be suboptimal in v, and choose v’ € E(v) such that 
Lj (v') = pj (v) for j Ai and uj(v') = ui + Ai(1 +e). Then, by Lemma 16.3 with 
A = A;(1 + £), 


yal Ti(n)] > wey (x (san) + log (= 228) 
= TT T (a p) log (n) + log ()) 


Plugging this into the basic regret decomposition identity (Lemma 4.5) gives the 
result. 


2 
Ry(t,v) 2 71 L -\2 5 
(1 +<) u:A;>0 


When p = 1/2, the leading term in this lower bound is approximately half that 
of the asymptotic bound. This effect may be real. The class of policies considered 
is larger than in the asymptotic lower bound, and so there is the possibility that 
the policy that is best tuned for a given environment achieves a smaller regret. 


Notes 


1 We mentioned that for most classes € there is a policy satisfying Eq. (16.3). 
Its form is derived from the lower bound, and by making some additional 
assumptions on the underlying distributions. For details, see the article 
by Burnetas and Katehakis [1996], which is also the original source of 
Theorem 16.2. 

The analysis in this chapter only works for unstructured classes. Without this 
assumption a policy can potentially learn about the reward from one arm 
by playing other arms and this greatly reduces the regret. Lower bounds for 
structured bandits are more delicate and will be covered on a case-by-case 
basis in subsequent chapters. 


N 


w 


The classes analysed in Table 16.1 are all parametric, which makes the 
calculation possible analytically. There has been relatively little analysis 
in the non-parametric case, but we know of three exceptions for which we 
simply refer the reader to the appropriate source. The first is the class of 
distributions with bounded support: M = {P : Supp(P) C [0,1]}, which has 
been analysed exactly [Honda and Takemura, 2010]. The second is the class 
of distributions with semi-bounded support, M = {P : Supp(P) © (—oo, 1]} 
[Honda and Takemura, 2015]. The third is the class of distributions with 
bounded kurtosis, M = {P : Kurtx.p[X] < «} [Lattimore, 2017]. 
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a factor of two on the standard Gaussian bandit problems as compared to the 
optimal asymptotic regret. 


Exercises 


16.1 (RELATIVE ENTROPY CALCULATIONS) Verify the calculations in Table 16.1. 


16.2 (RADEMACHER NOISE) Let R(u) be the shifted Rademacher distribution, 
which for u € R and X ~ R() is characterised by P(X =p+1) = 
P(X =p—1)=1/2. 


(a) Show that dint(R(u), u*,M) = œ for any u < p*. 
(b) Design a policy m for bandits with shifted Rademacher rewards such that 
the regret is bounded by 


k 
Ral, v) <3 0A; for all n and v € M®. 
i=1 


(c) The results from parts (a) and (b) seem to contradict the heuristic analysis 
in Note 1 at the end of Chapter 15. Explain. 
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16.3 (ASYMPTOTIC LOWER BOUND FOR EXPONENTIAL FAMILIES) Let M = {Po : 
0 € O} be an exponential family with sufficient statistic equal to the identity and 
E = M* and x be a consistent policy for £. Prove that the asymptotic upper 
bound on the regret proven in Exercise 10.4 is tight. 


16.4 (UNKNOWN SUBGAUSSIAN CONSTANT) Let 
M= {P : there exists a ø? > 0 such that P is o”-subgaussian } y 


(a) Find a distribution P such that P ¢ M. 
(b) Suppose that P € M has mean p € R. Prove that ding(P, u*,M) = 0 for all 
we > p. 
(c) Let E = {(P;) : Pi € M for all 1 < i < k}. Prove that if k > 1, then for all 
consistent policies 7, 
R,(7,v) 


lim inf —-—— = œ forallve€. 
nas Tog(n) 


(d) Let f : N > [0,00) be any increasing function with limn. f(n)/log(n) = 
oo. Prove there exists a policy m such that 

, R,(7,v) 

lim sup ——— 


where € is as in the previous part. 
(e) Conclude that there exists a consistent policy for E. 


=0 for all v EE, 


16.5 (MINIMAX LOWER BOUND) Use Lemma 16.3 to prove Theorem 15.2, possibly 
with different constants. 


16.6 (REFINING THE LOWER-ORDER TERMS) Let k = 2, and for v € E%, let 
A(v) = max{A;(v), Ao(v)}. Suppose that 7 is a policy such that for all v € ER, 
with A(v) < 1, it holds that 
C log(n) 
< ———_.. 
R,(1,v) < Aw) 

(a) Give an example of a policy satisfying Eq. (16.6). 
(b) Assume that i = 2 is suboptimal for v and that a € (0,1) be such that 
tun [L2(n)] = NCE log(a). Let v’ be the alternative environment where 
(v) = (v) and po(v’) = (v) + 2A(v). Show that 


1 
exp(— D(Pya, Pvr)) = a 


(16.6) 


(c) Let A be the event that T(n) > n/2. Show that 


Pp (A) < 2C log(n) ad Pale 1 _ 2C log(n) 


~ nA(v}? 
(d) Show that 
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nA(v)? 


8Clog(n) and conclude that 


(e) Show that a > 


1 nA(v)? 
Rn (mv) 2 2A(v) ne (acing) | 


(£) Generalise the argument to an arbitrary number of arms. 


E In Exercise 7.6 you showed that there exists a bandit policy m such that for 
some universal constant C > 0 and for any v € € n ,) k-armed bandit with 
rewards taking values in [0, b], the regret R,,(7, v) of r on v after n rounds 
satisfies 


2 
Rr(t,v) <0 >> (a+ (o+ x) log(n) ) l 
CA ¢ 


where A; = A;(v) is the action gap of action i and o? = o?(v) is the 

variance of the reward of arm i. In particular, this is the inequality shown in 

Eq. (7.14). The next exercise asks you to show that the appearance of both 
2 


band © is necessary in this bound. 


16.7 (SHARPNESS OF EQ. (7.14)) Let k > 1, b >0 and c > 0 be arbitrary. Show 
that there is no policy a for which either 


A Rn(7, v) " 
Hne ag oe e (16.7) 
or 
. n(m, V) a? (v) ‘ 
lims < i Wee 16.8 
no login) p Ko Al)’ Y E “(0.8 ee) 


would hold true. 


= The intuition underlying this result is the following: Eq. (16.7) cannot hold 
because this would mean that for some policy, the regret is logarithmic 
with a constant independent of the gaps, while intuitively, if the variance 
is constant, the coefficient of the logarithmic regret must increase as the 
gaps get close. Similarly, Eq. (16.8) cannot hold either because we expect a 
logarithmic regret with a coefficient proportional to the inverse gap even as 
the variance gets zero, as the case of Bernoulli bandits shows. This exercise 
is due to Audibert et al. [2007]. 


16.8 (LOWER BOUND ON REGRET VARIANCE) Let k > 1 and E C EẸ, be the set 
of k-armed Gaussian bandits with mean rewards in [0,1] for all arms. Suppose 
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that m is a policy such that for all v € €, 


Prove that 


im supsup CURRIN) 5 
n—00 VEE (1 — p) log(n) 


where R,(7,v) = np*(v) — Y; HAY): 


17 


High-Probability Lower Bounds 


The lower bounds proven in the last two chapters were for stochastic bandits. 
In this chapter, we prove high probability lower bounds for both stochastic and 
adversarial bandits. Recall that for adversarial bandit x € [0,1]"**, the random 
regret is 


n 


Rn = max J Tri — TA, 
i€ [k] PE 


and the (expected) regret is R, = E[R,]. To set expectations, remember that in 
Chapter 12 we proved two high-probability upper bounds on the regret of Exp3- 
IX. In the first, we showed there exists a policy m such that for all adversarial 
bandits x € [0,1}"** and 6 € (0,1), it holds with probability at least 1 — 6 that 


Ê =0 ( knlog(k) + a log (3) (17.1) 


We also gave a version of the algorithm that depended on 6 € (0,1) for which 
with probability at least 1 — ô, 


R, =O ( knlog (3) (17.2) 


The important difference is the order of quantifiers. In the first, we have a 
single algorithm and a high-probability guarantee that holds simultaneously for 
any confidence level. The second algorithm needs the confidence level to be 
specified in advance. The price for using the generic algorithm appears to be 

log(1/6)/log(k), which is usually quite small but not totally insignificant. We 
will see that both bounds are tight up to constant factors, which implies that 
knowing the desired confidence level in advance really does help. One reason 
why choosing the confidence level in advance is not ideal is that the resulting 
high-probability bound cannot be integrated to prove a bound in expectation. 
For algorithms satisfying (17.1), the expected regret can be bounded by 


Ry < fe (fn > u) du = O(./knlog(k)) . (17.3) 
0 
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On the other hand, if the high-probability bound only holds for a single ô, as in 
(17.2), then it seems hard to do much better than 


Ry <nd +O ( kn log (5) ; 


which with the best choice of ô leads to a bound of O(,/kn log(n)). 


Stochastic Bandits 


For simplicity, we start with the stochastic setting before explaining how to 
convert the arguments to the adversarial model. There is no randomness in the 
expected regret, so in order to derive a high-probability bound, we define the 
random pseudo-regret by 


k 
= 5 Ti(n)A 
i=1 
which is a random variable through the pull counts T;(n). 


For all results in this section, we let €* C Ef, denote the set of k-armed 
Gaussian bandits with suboptimality gaps bounded by one. For p € [0, 1]4 
we let v, € E! be the Gaussian bandit with means p. 


THEOREM 17.1. Letn > 1 and k >2 and B >Q and be a policy such that for 
any v € EF, 


Ri(t,v) < BY(k—-1)n. (17.4) 


Let 6 € (0,1). Then there exists a bandit v in E* such that 


P Ge AS ymin fn, L JE Dnlog (a) \) sg 


Proof Let A € (0,1/2] be a constant to be tuned subsequently and v = v, where 
the mean vector u € R? is defined by m= A and p; = 0 for i > 1. Abbreviate 
Rn = Rn(T, v) and P=P,, and E=E,,. Let i = argmin,,, E[T;(n)]. Then, by 
Lemma 4.5 and the assumption in Eq. (17.4), 


Rn <B n 
A(k-1) 7 A\k-1 


[T (n)] < (17.5) 


Define alternative bandit v’ = v, where u’ € R? is equal to p except u; = p; +2A. 


Abbreviate P’ = Pyy and R, = R(T, v) and R} = R,(m,v’). By Lemma 4.5, the 
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Bretagnolle-Huber inequality (Theorem 14.2) and the divergence decomposition 
(Lemma 15.1), we have 


n 


P (2, > =) +P @ > +) >P(Ti(n) > 5) +P (T(n) < 5) 


exp (= DP, P’) > 5 exp ( 2BA cz) 2, 


where the last line follows by choosing 


. J1 1 /k-1 1 
a= min |}. z m we (35) }- 


The result follows since max{a,b} > (a + b)/2. 


COROLLARY 17.2. Letn > 1 and k > 2. Then, for any policy x and 6 € (0,1) 


such that 


there exists a bandit problem v € E} such that 


P [Rainn > min fn a 1 log (=) }) >o. (17.7) 


Proof We prove the result by contradiction. Assume that the conclusion does 
not hold for m and let 6 € (0,1) satisfy (17.6). Then, for any bandit problem 
v € E*, the expected regret of m is bounded by 


R,(1,v) <nd-4 | Digg (=) < >e- 1) log (=) 


Therefore, m satisfies the conditions of Theorem 17.1 with B = \/2log(1/(46)) 
which implies that there exists some bandit problem v € €* such that (17.7) 
holds, contradicting our assumption. 


COROLLARY 17.3. Let k > 2 and p € (0,1) and B > 0. Then, there does not 
exist a policy x such that for alln > 1, 6 € (0,1) and v € EF, 


P (Fann) > By/(k — 1)nlog? G) <6. 


Proof We proceed by contradiction. Suppose that such a policy exists. Choosing 
6 sufficiently small and n sufficiently large ensures that 


1 1 1 1 1 
ea a | > Py a a —|\<n. 
g lee (5) > Blog G) and Bvuk 1) log (5) <n 
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Now, by assumption, for any v € €* we have 


R,(1,v) < [ PRG, v) > z) dx 


< By/n(k=1) [exp (-2”) dx < By/n(k—1). 

Therefore, by the Theorem 17.1, there exists a bandit v € E} such that 
P (Rav) > BV ale Dio (5) ) 

>P Ga > T min fn, Z Vink iog (5) \) >ô, 


which contradicts our assumption and completes the proof. 


{E We suspect there exists a policy m and universal constant B > 0 such that 
for all v € E¥, 


P (Baru) > BV kn log G) sö, 


17.2 Adversarial Bandits 


We now explain how to translate the ideas in the previous section to the adversarial 
model. Let m = (7), be a fixed policy, and recall that for x € [0,1]"**, the 
random regret is 


Rn = max (Lti — Tta). 


Let Fy be the cumulative distribution function of the law of Ban when policy 7 
interacts with the adversarial bandit x € [0,1]"**. 


THEOREM 17.4. Let c,C > 0 be sufficiently small/large universal constants and 
k>2,n>1 andô € (0,1) be such that n > Cklog(1/(20)). Then there exists a 
reward sequence x € [0,1}"** such that 


The proof is a bit messy, but is not completely without interest. For the sake of 
brevity, we explain only the high-level ideas and refer you elsewhere for the gory 
details. There are two difficulties in translating the arguments in the previous 
section to the adversarial model. First, in the adversarial model, we need the 
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rewards to be bounded in [0,1]. The second difficulty is we now analyse the 
adversarial regret rather than the random pseudo-regret. Given a measure Q, let 
X € [0,1]"** and (A;,)%_, be a collection of random variables on a probability 
space (Q, F, Pg) such that 


(a) Po(X € B) = Q(B) for all B € B((0, 1]”**); and 
(b) PQ (At | Aj, X1, ese ,At_1, Xt-1) = Til Ar | A1, Xı, TE , At-1, Xt-1) almost 
surely, where Xs = Xza,- 


Then the regret is a random variable R,, : Q > R defined by 


n 


A 


Rn = max (Xu — Xia). 
i€[k] ay 


Suppose we sample X € [0,1]"** from distribution Q on ([0,1]"**, B((0, 1]*)). 


CLAIM 17.5. Suppose that X ~ Q, where Q is a measure on [0,1]"** with the 
Borel o-algebra and that Eg[1 — Fx(u)] > 6. Then there exists an x € [0,1]”** 
such that 1 — F,(u) > ô. 


The next step is to choose Q and argue that Eg[1 — Fx (u)] > ô for sufficiently 
large u. To do this, we need a truncated normal distribution. Defining clipping 
function 

1 izl 
clipjo (£) = 0 ifr<0 


x otherwise. 


Let o and A be positive constants to be chosen later and (7)?_,; a sequence of 
independent random variables with m ~ N (1/2, 0°). For each i € [k], let Q; be 
the distribution of X € [0,1]"**, where 


clipjo ym +A) ifj=1 
Xij = § cippo, (w + 2A) ifj =iandi #1 


clipio,1) (7) otherwise . 
Notice that under any Q; for fixed t, the random variables X;1,..., Xtk are not 
independent, but for fixed j, the random variables X1j;,...,Xnj; are independent 


and identically distributed. Let Pg, be the law of X1, Aj,...,An,X» when policy 
m interacts with adversarial bandit sampled from X ~ Q;. 


CLAIM 17.6. Ifo >0 and A=oy,/ E-l log (+), then there exists an arm i such 
that 


Po, (Ti(n) < n/2) > 26. 


The proof of this claim follows along the same lines as the theorems in the 
previous section. All that changes is the calculation of the relative entropy. The 
last step is to relate T;(n) to the random regret. In the stochastic model, this was 
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straightforward, but for adversarial bandits there is an additional step. Notice 
that under Q;, it holds that Xs — Xz4, > 0 and that if Xni, Xtra, € (0,1) and 
A; # i, then Xr — X44, > A. From this we conclude that 


Ra > A (r — T;(n)— Sorese j € [k] : Xe; € co.) : (17.8) 


t=1 


The following claim upper-bounds the number of rounds in which clipping occurs 
with high probability. 


CLAIM 17.7. Ifo =1/10 and A < 1/8 and n > 32log(1/6), then 


Po, (So tei j € [k] : Xt; € {10,1} = n) <ô. 
t=1 


Combining Claim 17.6 and Claim 17.7 with Eq. (17.8) shows there exists an 
arm 2 such that 


which by the definition of A and Claim 17.5 implies Theorem 17.4. 


Notes 


m= 


The adversarial bandits used in Section 17.2 had the interesting property that 
the same arm has the best reward in every round (not just the best mean). 
This cannot be exploited by an algorithm, however, because it only gets a 
single observation in each round. 


bo 


In Theorem 17.4, we did not make any assumptions on the algorithm. If we 
had assumed the algorithm enjoyed an expected regret bound of Ra, < BVkn, 
then we could conclude that for each sufficientdder 
of his she-camel, from which the believers of every religious Community [umma] 
may drink, but not the unbelievers. 

According to yet another tradition [hadith], the Prophet (Allah bless 
him and give him peace) is reported as having said: 

My Basin is a wide as the distance between Aden (‘Adan]} and Oman [“Uman]. 
It is flanked by pavilions made of pearls that have been hollowed out, and its jugs 
are as numerous as the stars in the sky. [ts porcelain consists of the most fragrant 


musk, and its water is whiter than milk, cooler than snow, and sweeter than 
honey. Anyone who drinks a draught of it will never feel thirsty again. 


When the Day of Resurrection comes, some men will be chased away from me, 
just as the stray she-camel is chased away from the herd of camels, so I shall say: 
“Why don’t you come here! Why don’t you come here?” Then I shall hear 
people telling me: “You have no idea what mischief they got up to after your 
lifetime!” So] shall ask: “What mischief did they get up to?” Then people will 
say: “They introduced heretical changes and alterations.” So I shall say: 
“A curse be upon them, in that case, and let them be damned!” 

This belief in the Basin is rejected by the Mu‘ tazila,2*8 so they will not 
be allowed to drink from it. They will be sent away thirsty and made to 
enter the Fire of Hell, unless they repent their false doctrine, their 
denial of the Truth [Haqq], and their rejection of the relevant Quranic 
verses [@yat] and traditional reports [akhbar wa athar]. 

In a report transmitted on the authority of Anas ibn Malik (may 
Allah be well pleased with him) the following saying is attributed to the 
Prophet himself (Allah bless him and give him peace): 

Anyone who denies the reality of the intercession [shafa‘a] will have no share in 
it, and anyone who denies the reality of the Basin [hawd] will have no share in it. 


267 One of the Prophets of Arabia (peace be upon them). 
268 See note '74 on p. 178 above. 


On the orthodox Islamic doctrine that, on the Day 
of Resurrection, Allah will cause 
His Messenger and Prophet 
(Allah bless him and give him peace) 
to sit upon the Heavenly Throne. 


hose who remain faithful to the orthodox tradition of Islam [ahl 

as-Sunna] are firmly convinced that, on the Day of Resurrection, 
Allah will cause His Messenger [Rasiil] and His Chosen Prophet [Nabt 
Mukhtar] to sit above all the rest of His Prophets and His Messengers, 
together with Him upon the Throne. 

This belief is based on a report transmitted on the authority of 
“Abdu'llah ibn “Umar (may Allah be well pleased with him and with 
his father), according to whom the Prophet (Allah bless him and give 
him peace) once gave an explanation of His words (Almighty and 
Glorious is He): 

It may be [O Muhammad] that your Lord will raise you up to a praiseworthy 
station [magdman mahmiida]. (17:79) 


He said (Allah bless him and give him peace): 


[This means that] He will cause him to sit together with Himself upon the 
Throne.2# 


From a report transmitted on the authority of Hisham ibn “‘Urwa, we 
learn that ‘A’isha (may Allah the Exalted be well pleased with her) 
once asked Allah’s Messenger (Allah bless him and give him peace) 
what was meant by the ‘praiseworthy station,’ and he said (Allah bless 
him and give him peace): 

My Lord has promised to let me sit upon the Heavenly Throne2” 


A similar report has come down to us from “Umar ibn al-Khattab 
(may Allah be well pleased with him). 
269 yujlisuhse ma‘ ahu “ala's-sartr. 
270 wa’ adant Rabbi’ l-qu’iida “ala'l-‘arsh. 
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“Abdu’llah ibn Salam (may Allah be well pleased with him) is 
reported as having said: “When the Day of Resurrection has arrived, 
your Prophet will be brought forth, and he will then be seated in the 
presence of Allah, upon His Footstool [Kursi].” Some people said to 
him: “O Abi Mas‘id, if he is going to be on the Footstool of the Lord of 
Truth, does that mean he will not be together with Him?” So he replied: 
“Woe unto you! To me, this is the most delightful story in the world.” 

In giving his account of the same subject, al-Hajjaj said: “When the 
Day of Resurrection has arrived, the All-Compelling One [Jabbar] will 
set Himself down upon His Throne, with His feet upon His Footstool. 
Then your Prophet (Allah bless him and give him peace) will be 
brought forth, and he will be seated in His presence, upon His Foot- 
stool.” They said to al-Humaidi, “If he is going to be on the Footstool, 
will he really be together with Him?” Sohe replied: “Yes, woe unto you, 
he will indeed be together with Him!” 


i 


ab. 


On the orthodox Islamic doctrine that, 
on the Day of Resurrection, Allah (Exalted is He) 


will call His believing servant to account. 


hose who remain faithful to the orthodox tradition of Islam [ahl 
as-Sunna] are also firmly convinced that, on the Day of Resurrection, 
Allah (Exalted is He) will call His believing servant to account, that He 
will draw him close to Himself, and that He will cast His protective 
shadow over him, in order to hide him from the sight of human beings. 
This belief is based on the traditional report handed down on the 
authority of “Abdu'llah ibn ‘Umar (may Allah be well pleased with him 
and with his father), who stated that he once heard Allah's Messenger 
(Allah bless him and give him peace) say: 
The believer [mu’min] will be brought forth on the Day of Resurrection, and 
Allah (Exalted is He) will draw him close to Himself. Then He will cast His 
protective shadow over him, in order to hide him from the sight of human 


beings. Then He will say, repeating each question twice: “My servant, do you 
confess to such and such a sin? Do you confess to such and such a sin!” 


The servant will continue to reply: “Yes, my Lord,” until He has made him 
acknowledge every one of his sins, at which point he will feel sure that he must be 
doomed. But then the Lord of Truth (Almighty and Glorious is He) will say to 
him: “My servant, although you committed all these sins, | overlooked them while 
you were still in the world below, and today | shall grant you pardon for them.” 


Whar is meant by the ‘reckoning’ or ‘calling to account’ [muhdsaba] 
is the process whereby Allah makes His servant aware of exactly how 
much reward his actions have earned him, and exactly how much 
punishment is due, by reading out the list of his bad deeds or his good 
deeds, and noting what he has to his credit and what is recorded in his 
debit column. 


The Mu‘artila?”! may try to deny the reality of the reckoning, but 
Allah (Exalted is He) has given them the lie, for He has told us: 


Truly, unto Us is their return; then upon Us shall rest the responsibility for their 
reckoning. (88:25,26) 


271 See note '”7 on p. 179 above. 
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On the orthodox Islamic doctrine that Allah 
(Exalted is He) has a Balance [Mizan], in which, 
on the Day of Resurrection, He will weigh the 
good deeds and the bad deeds of the believers. 


hose who remain faithful to the orthodox tradition of Islim [ahl 

as-Sunna] are firmly convinced that Allah (Exalted is He) has a 
Balance [Mizan], equipped with two scales and a tongue, in which, on 
the Day of Resurrection, He will weigh the good deeds and the bad 
deeds of the believers. 

This belief is rejected by the Mu‘tazila,?” as well as by the Murji’a?”3 
and the Khawarij,?" since they maintain that the Balance is meant to 
be understood allegorically, as a figurative expression for Justice [‘adl], 
and thar it does not signify the literal weighing of deeds. The refutation 
of their doctrine can be found in the Book of Allah and the Sunna of 
His Messenger. Allah (Exalted is He) has said: 

And We shall set up the just scales [al-mawdzina’ L-gist} for the Day of Resurrec- 
tion, so that not one soul shall be wronged in anything at all. Even if it be the 


weight of one grain of mustard seed, We shall produce it, and sufficient are We 
for reckoners. (21:47) 

Then he whose deeds weigh heavy in the scales shall inherit a pleasing life, but 
he whose deeds weigh light in the scales, his mother shall be the Pit. Ah, what 
will convey to you what she is? A blazing fire! (101:6-11)775 


Justice is nor described merely in terms of lightness and heaviness! The 
Balance must surely be held in the hand of the All-Merciful 
[ar-Rahman] (Glorious is His Majesty), because He is the One who 


222 See note '”* on p. 178 above. 

233 There is a lack of unanimity among the scholars with regard to how the Murji’a [“The 
Postponers”} came to be so called. The most probable explanation is that they acquired the name 
because of their great emphasis on the doctrine of ia’ [postponement], according to which the 
judgment of i believers must be deferred until the Resurrection. 


274 See note * 47 on p. 217 above. 


275 fa-amma man thaqulat mawdzinuhu—fa-huwa fi ‘Ishatin rddiya—wa amma man khaffat 
mawéizinuchu—fa-ummuhu Hawitya—wa ma adréka ma hiya—narun hamiya. 
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is in charge of their reckoning. This is confirmed by the traditional 
account of an-Nawwas ibn Sam‘an al-Kullabi (may Allah be well 
pleased with him), who reported that he once heard Allah’s Messenger 
(Allah bless him and give him peace) say: 
The Balance [Mizan] will be held in the hand of the All-Merciful (Almighty and 
Glorious is He). Some people it will raise up high, and others it will bring down 
low, on the Day of Resurrection. 

There are some, however, who maintain that it will be held in the 
hand of Gabriel (peace be upon him), on the strength of a report 
transmitted by Hudhaifa ibn al-Yamani (may Allah be well pleased 
with him), who said: “Gabriel (peace be upon him) will be the holder 
of the Balance, so His Lord will say to him: ‘Weigh, O Gabriel, and 
see how their weights compare!’ Then they will be checked, one in 
comparison to another.” 

According to the report of “Abdu'llah ibn “Umar (may Allah be well 
pleased with him and with his father), Allah’s Messenger (Allah bless 
him and give him peace) once said: 

The Balance will be set up on the Day of Resurrection. A man will then be 
brought forth, to be placed in one scale of the Balance, while all of his deeds that 
have been counted are also placed in one scale or the other. When this causes 


the Balance to tilt [in the wrong direction], Allah will give the order for the man 
be led off to the Fire. 


Bur at that very moment, just when he has turned to go, a crier will cry out from 
the presence of the All-Merciful: “Do not be in such a hurry! Do not be in such 
a hurry, for there is still something left to be counted in his favor!” Something 
will then be produced, in which the words ‘La ilaha illa’llah [There is no god but 
Allah)’ can be seen. This will be put in the man's place in the scale of his good 
deeds, causing the Balance to tilt in his favor, so the order will then be given 
for him to be led off to the Garden of Paradise. 


In yet another tradition [hadith], the following words are attributed to 
the Prophet (Allah bless him and give him peace): 


On the Day of Resurrection, a man will be brought up to the Balance. Then 
ninety-nine scrolls [sijill] will be produced, each of them stretching as far as the 
eyecansee. Allof these scrolls will contain the record of his evil deeds and sinful 
errors. His evil deeds will thus be seen to outweigh his good deeds, so the order 
will be given for him to be led off to the Fire of Hell. 


Bur at that very moment, just when he has turned to go, a crier will cry out from 
the presence of the All-Merciful: “Do nor be in such a hurry! Do not be in such 
a hurry, for there is still something left to be counted in his favor!" Something 
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no bigger than the tip ofa man’s thumb will then be produced. (As he was saying 
this, the Prophet squeezed half of his own thumb between his fingers.)?* 


It will contain the declaration of faith [shahada], testifying that there is no god 
but Allah, and thar Muhammad is the Messenger of Allah [an [a ilaha illa’ lah wa 
anna Muhammadan Rasiilu’llah]. This will be placed in the scale of his good 
deeds, causing his good deeds to weigh heavier than his evil deeds, so the order 
will then be given for him to be led off to the Garden of Paradise277 

It has been said that the weights [sanj] to be used on that Day [of 
Resurrection] will be those that are now used to measure tiny grains and 
mustard seeds. 

Good deeds will take the form of some beautiful object, which will be 
cast into the scale of radiant light. The Balance will then record the 
weight of it as heavy, because of the mercy [rahma] of Allah. Evil deeds 
will take the form of some foul object, which will be cast into the scale 
of darkness. The Balance will then register the weight of it as light, 
because of the justice [“adl] of Allah (Exalted is He). 

The Balance will indicate the presence of a heavy weight by the 
upward motion of the scale, and of a light weight by the downward 
motion of the scale, contrary to the balances of this world. What causes 
it to register a heavy weight will be faith [tman] and the utterance of the 
twofold declaration of belief [qawl ash-shahadatain], and what causes it 
to register a light weight will be the ascription of partners[shirk] to Allah 
(Almighty and Glorious is He). If the scale moves upward, the person 
concerned will be admitted to the Garden of Paradise, because it is up 
on high. But if it registers a light weight, the person concerned will be 
made to enter the bottomless pit of the Fire of Hell, because it is in the 
region of the lowest of the low. As Allah (Almighty and Glorious is He) 
has said: 

Then he whose deeds weigh heavy in the scales shall inherit a pleasing life, but 
he whose deeds weigh light in the scales, his mother shall be the Pit. (101:6-9) 


In other words, the former shall dwell in a Garden of Paradise on high, 


26 This sentence represents a parenthetic observation by the narrator of the tradition [hadith], 
interjected to give graphic effect to the words of the Prophet (Allth bless him and give him peace). 
277 Author's note: In aslightly different version, the wording is: “A scrap of paper [qirtis} no bigger 
than this—the Prophet squeezed his thumb to demonstrate—will be produced on his behalf. It 
will contain the declaration of faith [shahddal, testifying that there is no god but Allah, and that 
Muhammad is the Messenger of Allth [an lailatha dla’ llah—wa anna Muh Rasiilu’ llah]....” 

The rest of the tradition (hadith) is narrated in the same words as the version quoted in the text. 
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while the latter must have his base, his abode and his destination in a 
blazing fire, for such is the Pir. 

When it comes to the balancing [muwazana] of deeds, human beings 
fall into three different categories: 

1. There are some whose good deeds outweigh their evil deeds, and 
who will therefore be commanded to enter the Garden of Paradise. 

2. There are some whose evil deeds outweigh their good deeds, and 
who will therefore be commanded to enter the Fire of Hell. 

3. There are some cases where neither set of deeds ourweighs the 
other. Such people are known as the Dwellers on the Heights [Ashab 
al-A‘raf], until such time as Allah bestows His mercy upon them, 
whenever He so wills, and finally allows them to enter the Garden of 
Paradise. It is they who are referred to in His words (Almighty and 
Glorious is He): 

And on the Heights [“ala’l-A ‘rafi] there are certain men who know them all by 
their marks. And they call out to the inhabitants of the Garden of Paradise: 
“Peace be upon you!” They have not entered it themselves, although they may 
hope [to do so one day]. (7:46) 

There is also the case of the man whose deeds will be weighed as 
written records, in the form of ninety-nine scrolls, as we have already 
mentioned. All that we know about this has reached us by way of oral 
transmission and hearsay. 

Asfor those who enjoya special intimacy with the Lord[al-mugarrabim] 2"* 
they will be allowed to enter the Garden of Paradise without any 
reckoning. 

To quote the exact words of the well-known tradition [hadith]: 
Seventy thousand will enter the Garden of Paradise without any reckoning, and 
each one of them will bring seventy thousand along with him. 

As for the unbelievers [kafiriin], they will enter the Fire of Hell 

without any reckoning. 

Among the believers [mu’miniin], there are some who will have to 
undergo a simple reckoning, and then the order will be given for them 
to be admitted to the Garden of Paradise, as mentioned above. There 
are also some who will be subjected to a rigorous examination, and then 
each case will be left for Allah to decide. He may give the order for a 


278 Literally, ‘those who are drawn close.’ 
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particular individual to be admitted to the Garden of Paradise, or to the 
Fire of Hell, as He wills. He has said (Almighty and Glorious is He): 


Then as for him who is given his record in his right hand, he shall surely receive 
an easy reckoning, and he will return to his family in joy. But as for him who 
is given his record behind his back, he will surely call for destruction, and he 
shall be roasted at a scorching blaze. (84:7-12) 

And every man's bird of omen We have fastened to his own neck, and We shall 
bring forth for him, on the Day of Resurrection, a book which he will find wide 
open. [And it will be said unto him:] “Read your book! Your own soul suffices 
you this day as a reckoner against you.” (17:13,14) 


We may also cite the tradition [hadith] of “Ali (may Allah be well 
pleased with him), according to whom the Prophet (Allah bless him 
and give him peace) once said: 

Allah will surely call every creature to account, except for those who ascribe 


partners [ashraka] to Allah. Ifa person is guilty of this, he will not be called to 
account; he will be ordered to go straight into the Fire of Hell. 


On the orthodox Islamic doctrine 
that the Garden of Paradise [al-Janna] and 
the Fire of Hell [an-NGr] are products of creation. 


hose who remain faithful to the orthodox tradition of Islam 

[ahl as-Sunna] are firmly convinced that both the Garden of 
Paradise [al-Janna] and the Fire of Hell [an-Nar] are products of 
creation. They are two abodes [daran] which Allah (Exalted is He) has 
made ready, one of them for the blissful comfort and reward to which 
the people of worshipful obedience and faith [ahl at-ta‘a wa’ l-tman] are 
entitled, and the other for the chastisement and exemplary punishment 
deserved by those who are guilty of sinful disobedience and trangression 
[ahl al-ma‘ ast wa't-taghyan]. 

Since the moment when Allah (Exalted is He) created them both, 
they have been everlasting, and they shall never cease to exist. The 
Garden of Paradise is the place where Adam and Eve (peace be upon 
them both) once dwelt, as well as the accursed Iblis, but then they were 
evicted from it, as the well-known story tells. 

The Mu tazila?’? refuse to accept this doctrine. As for the Garden of 
Paradise, therefore, they shall not enter it, and as for the Fire of Hell— 
by my life!—they shall dwell in it forever, condemned for all eternity 
because of their denial. They surely deserve such a fate, since they 
themselves consider it to be the proper punishment for a believer 
{mu’min] who is guilty of a single major sin, regardless of his being one 
who affirms the Divine Unity [muwahhid] and who is normally obedient 
to Allah (Almighty and Glorious is He). 

The refutation of their false doctrine can be found in the Book of 
Allah and in the Sunna of Allah’s Messenger (Allah bless him and give 
him peace). Allah (Almighty and Glorious is He) has said: 

And a Garden as wide as are the heavens and the earth, which has been made 
ready for those who are devoted to their duty [muttagin].. (3:133) 


279 See note 14 on p. 178 above. 
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Beware of the Fire which has been made ready for the unbelievers [kafirin]. 
(3:131) 

If something has been made ready, it must have been brought into 
existence. Every rational person knows this, so he must acknowledge 
the fact that the Garden of Paradise and the Fire of Hell are both 
products of creation. 

According to the tradition [hadith] of Anas ibn Malik (may Allah be 
well pleased with him), Allah’s Messenger (Allah bless him and give 
him peace) once said: 

I was admitted to the Garden of Paradise, and—lo and behold!—there I was 
beside a flowing stream, flanked on both sides by pavilions made of pearls. | 
dipped my hand in [what looked like] water flowing by, and—lo and behold!— 
it was musk, of the most exquisitely fragrant kind. | said: “O Gabriel, what is 
this?" He replied: “This isthe River of Abundance [al-Kawthar], which Allah 
(Exalted is He) has bestowed upon you.” 

According to the tradition [hadith] of Abi Huraira (may Allah be well 
pleased with him), somebody once said to the Prophet (Allah bless him 
and give him peace): “O Messenger of Allah, tell us about the Garden 
of Paradise. Of what is it constructed?” To this he replied (Allah bless 
him and give him peace): 

A brick of gold anda brick of silver. Its pavement is musk of the most exquisitely 
fragrant kind. Its pebbles are sapphires [yaqiit] and pearls [lu’lu’], and its soil is 
turmeric [wars] and saffron [za‘fardn]}. Those who enter it will stay there forever. 
They will never die. They will lead a comfortable and carefree life, and they will 
never feel despair. Their clothes will never turn to rags, and their youth will 
never become old age. 

Here we have evidence to prove that both [the Garden of Paradise 
and the Fire of Hell] are products of creation, and that the bliss 
experienced in the Garden of Paradise is a permanent state, which shall 
never pass away. As Allah (Exalted is He) has said: 


Its food is everlasting, and [so is] its shade. (13:35) 
Neither out of reach nor yet forbidden. (56:33) 
Included among its blessings are the fair maidens with wide, lovely 
eyes [al-hitr al-‘In],2° whom Allah has created to abide forever in the 


Garden of Paradise. They shall not pass away, nor shall they ever die. 
As Allah (Almighty and Glorious is He) has said: 


Therein are maidens of modest gaze, whom neither man nor jinn have touched 
before them. (55:56) 


28 See note ® on p. 92 above. 
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We may also quote His words (Hallowed and Exalted is He): 

Fair ones, close-guarded in pavilions [hitrun magsiiratun fi’ -khiyam]. (55:72) 

Umm Salama, the wife of the Prophet (Allah bless him and give him 
peace), related the following conversation: 

“T said: ‘O Messenger of Allah, tell me about the words of Allah 
(Almighty and Glorious is He): “Like unto hidden pearls [ka-amthali’|- 
lu’ lu’i’ -makniin].28"" To this he replied: ‘Their pure beauty [safa] is like 
the pure beauty of pearls inside their oyster shells.’ He went on talking, 
until he eventually said: ‘They say: “We are the immortal females 
[khalidat], so we shall never die. We are the blissfully happy females 
[na‘imat], so we shall never feel despair. We are the permanently 
abiding females [mugtmat], so we shall never move away. We are the 
well-contented females [radiyat], so we shall never be dissatisfied.” They 
dwell in an abode of Truth [Haggq], so they tell nothing but the truth.” 

We have it on the authority of Mu‘adh ibn Jabal (may Allah be well 
pleased with him) that the Prophet (Allah bless him and give him 
peace) once said: 

Ifa wife makes her husband suffer in this world, what is bound to happen is that 
his wife among the fair maidens of Paradise, with those wide, lovely eyes [al-hiir 
al-“in}, will say to her: “Do not make him suffer—may Allah curse you!—for 
he is only a temporary guest in your company, always on the verge of leaving you 
for us.” 

Once it has been established beyond doubt that both the Garden of 
Paradise and the Fire of Hell, and all that they contain, shall never cease 
to exist, it is clear that Allah (Exalted is He) will not expel anyone from 
the Garden of Paradise, that death can have no power over those who 
dwell therein, and that they shall never cease to experience its joyful 
bliss. Their bliss will actually increase from day to day, through the 
endlessness of all eternity. To make their bliss complete, Allah will give 
the command for death itself to be slaughtered, on a wall between the 
Garden of Paradise and the Fire of Hell. The herald will then cry out: 
“O you who dwell in the Garden of Paradise, now there is only eternal 
life [khuliid], for death is no more! O you who dwell in the Fire of Hell, 
now there is only eternal life, for death is no more!” 

This is based on the authentic tradition [khabar sahth], as it has been 
handed down to us from the Prophet (Allah bless him and give him 
peace). 


28! Qur'an 56:23. 


On the orthodox Islamic doctrine concerning the 
Basin [Hawd] of the Prophet (Allah bless him and 
give him poadl from which the believers will 


quench their thirst at the Resurrection. 


hose who remain faithful to the orthodox tradition of Islam [ahl 

as-Sunna] are firmly convinced that, when the Resurrection comes, 
our Prophet (Allah bless him and give him peace) will have at his 
disposal a Basin [Hawd] from which he will supply water to quench the 
thirst of the believers, but not of the unbelievers, and that this drinking 
from the Basin will occur after their crossing of the Narrow Bridge 
[Sirat], and prior to their entry into the Garden of Paradise. 

Anyone who drinks a draught from the Basin will never feel thirsty 
again. The width of it is equal to the distance traveled in a month. Its 
water is whiter than milk and sweeter than honey. Around it there are 
as many jugs as there are stars in the sky. Inside it there are two spouts, 
through which the water is channeled from the River of Abundance 
[Kawthar], the source of which is in the Garden of Paradise, while its 
offshoot is at the Place of Standing.2°¢ 

According to the tradition [hadith] of Thawban (may Allah be well 
pleased with him), it was mentioned by the Prophet (Allah bless him 
and give him peace) when he said: 

I shall be there beside my Basin on the Day of Resurrection. 

The Prophet (Allah bless him and give him peace) was asked about 

the capacity of the Basin, so he said: 


[It is as wide as} the distance between this spot, where | am standing now, and 
“Uma [in the southeast corner of Arabia]. The drink it contains, which is 
whiter than milk and sweeter than honey, is channeled into it from the Garden 
of Paradise through two spouts, one of them made of silver and the other of gold. 
Anyone who drinks a draught from it will never feel thirsty again. 


266 The site at ‘Arafa where Pilgrims perform the rite of ‘standing’ [wuqif]. 
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or prose, rhetoric or eloquence, but has a style that surpasses the 
eloquence [fasaha] of every eloquent speaker [fasth], and the rhetorical 
skill [balagha] of every fluent orator [baligh]. The Arabs were incapable 
of producing anything to compare with it. They could not even match 
one chapter [stra] from it. Allah (Exalted is He) did in fact challenge 
them, saying: 

Then produce ten siras the like thereof, invented [by yourselves}. (11:13) 

When they failed to meet this challenge, He said (Exalted is He): 
Then produce a single siira the like thereof. (2:23) 


But they were incapable even of this, however great their superiority 
over all their contemporaries in the arts of rhetoric and eloquent speech. 
They had to admit defeat, since the superior merit of the Qur’dn was 
overwhelmingly apparent to them. 

This is why the Qur’dn has come to be recognized as a miracle 
[mu ‘jiza] peculiar to the Prophet Muhammad (Allah bless him and give 
him peace), comparable to the miracle of the staff [“asa] in the case of 
Moses (peace be upon him). Moses was sent during the era in which the 
sorcerers [sahara] and skilled practitioners of the art of magic were in the 
ascendant, so the staff of Moses (peace be upon him) swallowed up all 
the tricks they used to employ in order to bewitch people's eyes and to 
cast a spell upon them. 

So they were vanquished there, and they turned about, reduced to humiliation. 
And the sorcerers were cast down, bowing low in prostration. (7:119,120) 

It is also comparable to the miracles performed by Jesus (peace be 
upon him), when he brought the dead back to life, and in his healing 
of the leper and the man who had been blind from birth [akmah]. Jesus 
(peace be upon him) was sent during a period of history when the most 
influential people were skilled physicians. They were well aware that 
certain sicknesses and diseases could not be cured by them, for all their 
superb skill in the practice of medicine, so they accepted his guidance 
and believed in him, on account of his superiority over them in their 
own craft, and his remarkable ability to perfom miracles in their own 
field of professional expertise. 

Thus the eloquence of the Qur’4n and its inimitable character [ijaz] 
can be regarded as a miracle [mu ‘jiza] peculiar to the Prophet (Allah 
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bless him and give him peace), comparable to the miracles of the staff 
and the restoration of the dead to life, in the respective cases of Moses 
and Jesus (peace be upon them both). 

Other miracles performed by the Prophet (Allah bless him and give 
him peace) include: 

1. The gushing forth of water from between his fingers.2 

2. Feeding a multitude of people with a tiny supply of food.285 

3. The speaking of the poisoned arm, and its uttering the words: “Do 
not eat from me, for | have been poisoned!” 

4. The splitting of the moon [inshigdg al-qamar].2* 

5. The plaintive cry of the palm tree stump [hanin al-jidh‘].2%7 

6. The speaking of the camel. 

7. The movement of the tree toward him.* 


Of course, these are but a few of the miracles ascribed to him (Allah 
bless him and give him peace), since their total number is said to be as 
high as one thousand. 

There are two good reasons to explain why the Prophet (Allah bless 
him and give him peace) did not reproduce such earlier miracles as the 
staff of Moses and his white hand, the revival of the dead and the 
healing of the leper and the man who had been blind from birth 
[performed by Jesus], the she-camel of Salih,”*’ or the particular miracles 
assigned to the rest of the Prophets [anbiya’]. One of these two reasons 
is that it was necessary to ensure that his Community [umma] would not 


284 The Prophet (Allah bless him and give him peace) is said to have produced this miraculous 
supply of water to quench the thirst of his Companions (may Allah be well pleased with them all) 
during the encounter with the unbelievers of Quraish that resulted in the treaty of al-Hudaibiyya. 
285 According to some traditional reports, the Prophet (Allih bless him and give him peace) once 
fed a thousand people with the meat of one kid and a single measure [sd‘] of barley. 

286 According to some traditional authorities, this miracle is referred to in the verses of the Qur’an 
(54:1,2) which read: “The hour has drawn nigh, and the moon has split in two. Yet if they see 
a miraculous sign [aya] they turn away, and they say: ‘[It is nothing more than] a prolonged trick 
of sorcery [sihrun mustamirr]}.” 

Ar this point in his famous commentary [tafstr] on the Qur’ fn, al-Baidawi says: “Some say that 
the unbelievers demanded this sign of the Prophet (Allah bless him and give him peace), and the 
moon was split in two; but others say it refers to a sign of the coming Resurrection.” 

287 This stump is said to have wept to such an extent thar it almost split in two, because it was so 
disappointed when the Prophet (Allth bless him and give him peace) refrained from leaning 
against it. 

288 Tradition tells that the Prophet (AlLth bless him and give him peace) once fell asleep beneath 
a tree, which then moved with the sun in order to keep him in the shade. According to another 
traditional report, two trees miraculously moved to provide him with shade. 
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refuse to accept them as true, and so be doomed to perish as the 
Communities [umam] before them had perished. As Allah (Exalted is 
He) has said: 
Nothing has prevented Us from sending the miraculous signs [ayat], except that 
the folk of old cried lies to them. (17:59) 

The second reason is that if he had in fact produced miracles similar 
to those performed by his predecessors, people would have said to him: 
“You have not come up with anything unheard of. You are merely 
copying from Moses and Jesus, so you must be one of their followers. We 
shall not believe in you, until you produce for our benefit something 
that none of the ancients ever produced.” This is why Allah (Glory be 
to Him) never conferred upon any of His Prophets the same miracle He 
had already bestowed upon another, but chose instead to grant each 
Prophet a special miracle of his own, quite different from that of his 


predecessor. 
G 
© 


289 The story of this she-camel is told in the Qur’Sn (7:77.78): 
So they hamstrung the she-camel, and they floured the commandment of their Lord, and they 
said: ‘O Salih! Bring your threats to bear upon us, if you are indeed one of those sent [by Allah]. 
So the earthquake caught them unawares, and morning found them lying prostrate in their 
place of abode. 
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He sent Muhammad (Allah ba‘atha Muhammadan 
bless him and give him peace) (salla’llahu ‘alaihi wa 
with the Truth, sallam) bi' l-hagq: 

as a Prophet pure and nabiyyan safiyyan bari’an 
innocent of all vices. mina’ l-“ahati kulliha. 
And he delivered the message wa ballagha ma 

he was sent to deliver, ursila bih: 

like a brilliant lamp and sirdjan zahiran 

a radiant light wa niiran sdti‘an 

and a shining proof. wa burhdnan lami‘a. 
May Allah bestow sala’ llahu “alaihi 

His blessings and grant peace wa sallama 

to him and to all wa “ala alihi 

the members of his family. ajma‘in. 

Well then, surely these thumma inna hadhihi’l- 
matters are all in Allah's hand. umiira kullaha bi-yadi' llahi 
He causes them to flow yusarrifuha 

in their proper channels, fi tara’ igiha 

and He makes them follow wa yumaddiha 

their right courses. ft haga’ igiha. 

No one can bring forward la mugaddima 

that which He has postponed, li-ma akhkhara 

and no one can put back wa la mu’akhkhira 
what He has brought to the fore. li-ma gaddama. 

No couple can be joined wa la yajtami‘u’thndni 
together except by His illa bi-gada’ ihi 
judgment and His decree. wa qadarih. 

For every judgment wa li-kulli qada’in 
there is a decree, gadarun 

and for every decree wa li-kulli gadarin 
there is an appointed time, j 

and “for every appointed time wa li-kulli ajalin 

there is a written record.”22 kitab. 

“Allah erases what He will, yambhw’ llahu ma yasha’u 
and He fixes (what He will), wa yuthbitu 

and with Him is the wa “indahu 

Essence of the Book.23 ummu’ L-kitab. 

It was in accordance with wa kana min 

the judgment of Allah gada’i’ Ilahi 

and His decree, wa gadarihi 

that So-and-So, the son of an yakhtuba 
So-and-So, should offer Fulanu'bnu Fulan 


#22 Qur'an 13:38 (these two lines only, excluding the initial word ‘and [wa].") 
"3 Qur'an 13:39. 
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The most excellent of these are the ten for whom the Prophet (Allah 
bless him and give him peace) vouched that they would surely enter 
the Garden of Paradise. They are: Aba Bakr, ‘Umar, “Uthman, “Ali, 
Talha, az-Zubair, “Abd ar-Rahman ibn “Awf, Sa‘d [ibn Abi Waqais], 
Sa‘id [ibn Zaid], and Abi “Ubaida ibn al-Jarrah.2% 

The most excellent of these righteous ten are the rightly guided 
Caliphs [al-Khulafa’ ar-Rashidiin], the four who are the best of all. 

The most excellent of these four is Abi Bakr, followed by ‘Umar, 
then by “Uthman, and then by “Ali (may Allah the Exalted be well 
pleased with them all). 

The Caliphate [khilafa] belonged to these four, following the death of 
the Prophet (Allah bless him and give him peace), for a period of thirty 
years all told. Abi Bakr (may Allah be well pleased with him) was in 
command for just over two years, “Umar (may Allah be well pleased 
with him) for ten, ‘Uthman (may Allah be well pleased with him) for 
twelve, and ‘Ali (may Allah be well pleased with him) for six. The 
office of Caliph was then held by Mu‘awiya for a period of nineteen 
years, prior to which he had spent twenty years as governor of the people 
of Syria, in a post to which he had been appointed by “Umar. 

The Caliphate of the Four Imams was a matter of free election by the 
Companions, with their unanimous agreement and willing consent. Its 
acquisition was also due to the superior merit of each of them, in his own 
age and time, over and above the rest of the Companions. It was not 
taken by the sword, by compulsion, coercion and forcible seizure from 
someone more worthy of the position. 


2 


293 These ten noble Companions (may Allah be well pleased with them all) are generally referred 
to as al-‘Ashara "I-Mubashshara [“The ten who received glad tidings.”] 


On the Caliphate of Aba Bakr, 
the Champion of Truth [as-Siddiq] 
(may Allah be well pleased with him). 


s for the Caliphate of Abi Bakr, the Champion of Truth 

[as-Siddig] (may Allah be well pleased with him), it came about 
through the unanimous agreement of the Emigrants [Muhdjiriin] and 
the Helpers [Ansdr].2% The way it happened was as follows: When the 
earthly life of Allah’s Messenger (Allah bless him and give him peace) 
had come toan end, the spokesmen of the Helpers said: “[Let there now 
be] a leader [amir] from amongst ourselves, and a leader from among you 
[Emigrants].” 

Then up stood “Umar ibn al-Khattab (may Allah be well pleased with 
him), saying: “O band of Helpers, surely you are well aware that the 
Prophet (Allah bless him and give him peace) commanded Abi Bakr 
to lead [an ya’umma] the people [in prayer]?” They all said: “Yes, of 
course!” So he said: “In that case, which of you considers himself fit to 
stand in front of Abi Bakr?” They all responded to this by saying: “May 
Allah protect us from presuming to stand in front of Abi Bakr!” 

According to a slightly different account of this same sequence of 
events, the words spoken by “Umar (may Allah be well pleased with 
him), were: “In that case, which of you considers himself fit to remove 
him from the position to which he was appointed by Allah’s Messenger 
(Allah bless him and give him peace)?” They all responded to this by 
saying: “None of us considers himself fit to do such a thing! We seek 
forgiveness from Allah [for having entertained such a notion].” 

Thus they found themselves in complete accord with the Emigrants, 
and so the full complement of the Companions paid homage to Abi 
Bakr, including ‘Ali and az-Zubair.2% 

24 These terms have been explained in note 453 on p. 225 above. 

295 The participation of “Alf and az-Zubair is emphasized ar this point, because these two would later be 
involved in serious controversy and conflict over the succession to the Caliphate. 
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This also explains why it is stated, in the authentic traditional report, 
that when homage was paid to Abii Bakr (may Allah be well pleased 
with him) he stood up three times, turning to face the people as he said: 
“O people, should I release you from your pledge of allegiance [bai ‘a] to 
me? Is anyone making it reluctantly?” 

“Ali (may Allah be well pleased with him) was among the first of the 
people present to stand up and say in reply: “We shall never depose you, 
we shall never ask for your resignation! Irwas Allah's Messenger (Allah 
bless him and give him peace) who brought you to the fore, so who 
would dare to push you to the rear?” We have in fact been informed, 
on the authority of very reliable sources, that “Ali (may Allah be well 
pleased with him) was the most outspoken of all the Companions in 
favor of the leadership [imama] of Abi Bakr (may Allah be well pleased 
with him). 

We know from traditional reports that “Abdu’llah ibn al-Kawwa’ 
entered the presence of “Ali (may Allah be well pleased with him) after 
the Battle of the Camel,2% and asked him: “Did Allah’s Messenger 
(Allah bless him and give him peace) entrust you with any special 
information concerning this matter [of the Caliphate]?” To this he 
replied by saying: “We examined our situation carefully, and we had to 
acknowledge that the ritual prayer [salat] is the mainstay of Iskim. We 
were therefore content to accept, as appropriate to the conduct of our 
affairs in this world, that which Allah and His Messenger had seen fit 
to approve for the sake of our religion [din]. So we entrusted the matter 
[of the Caliphate] to Abi Bakr.” 

The point that “Ali (may Allah be well pleased with him) is making 
here is that the Prophet (Allah bless him and give him peace), during 
the days of his final sickness, had delegated the task of leading the 
prescribed ritual prayer [imamat as-salat al-mafriida] to Abii Bakr, the 
Champion of Truth [as-Siddig] (may Allah be well pleased with him). 
Bilal?” would come to him at the time appointed for each of the five 
The Battle of the Camel was fought in 4.4. 36, between “Ali and his supporters on one side, and 
an army assembled by Talha and az-Zubair on the other. The battle, in which “All was victorious, 


acquired its name from the camel ridden by “A’ isha (may Allth be well pleased with her), who was 
in the thick of the fight, on the side of Talha and az-Zuhair, until her camel was killed. 


297 Bilal, an Abysinnian slave who had been ransomed by Abd Bakr, was the first muezin 
{mu’adhdhin} appointed by the the Prophet (Allah bless him and give him peace) to summon the 
Muslim community to the five daily prayers. 
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daily prayers, and he would utter the Call to Prayer in his presence 
[yu’adhdhinuhu bi’ s-salat]. The Prophet (Allah bless him and give him 
peace) would then say: “Go and tell Abi Bakr that he must lead the 
people in prayer.” 

While he was still alive and well, the Prophet (Allah bless him and 
give him peace) would often speak on the subject of Abi Bakr, (may 
Allah be well pleased with him), in such terms that it became quite 
obvious to all the Companions that he must be the person best qualified 
to assume the Caliphate when the Prophet (Allah bless him and give 
him peace) was no longer with them. It was made equally obvious, in 
respect of “Umar, “‘Uthman and ‘Ali (may Allah be well pleased with 
them), that each of them would be the one best qualified to assume that 
office in his own age and time. 

Among other evidence to this effect, we have the traditional report 
of Ibn Batta, complete with its chain of transmission [isnad] from “Ali 
(may Allah be well pleased with him), who said: “Someone asked: 
‘O Messenger of Allah, whom shall we appoint to be our leader, when 
you are no longer with us?” He replied (Allah bless him and give him 
peace) by saying: 

If you appoint Aba Bakr to be your leader, you will find him trustworthy, 
abstemious in relation to this world, and enthusiastic in relation to the 
hereafter. If you appoint “Umar to be your leader, you will find him strong and 
trustworthy, someone who is not afraid, when the interest of Allah is at stake, 
of criticism from any critic. If you put “Alvin charge, you will find him a good 
guide, who is himself rightly guided. 

“Because of this, they agreed unanimously that the Caliphate should 
go to Abi Bakr.” 

According to another traditional report, handed down to us on the 
authority of our own Imam, Aba ‘Abdi’llah Ahmad ibn Hanbal (may 
Allah bestow His mercy upon him), the validity of the Caliphate of Abi 
Bakr (may Allah be well pleased with him) has been established beyond 
any doubt, by clear and explicit evidence as well as by implication. 

This also happens to be the doctrine [madhhab] of al-Hasan al-Basri 
and a significant group of experts in the tradition [ashab al-hadith]. One 
piece of evidence to support their view is provided by Abi Huraira (may 
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Allah be well pleased with him), according to whose report the Prophet 
(Allah bless him and give him peace) once said: 
When | was carried aloft on my Heavenly Ascension, I asked my Lord 
(Almighty and Glorious is He) to appoint “Aliibn Abi Talib as Caliph after my 
lifetime. But the angels said: “O Muhammad, Allah does whatever He wills. 
The Caliph after you will be Aba Bakr.” 

The Prophet (Allah bless him and give him peace) also said, according 
to the tradition [hadith] of Ibn “Umar (may Allah be well pleased with 
him and with his father): 

The one [who will be Caliph] after me is Aba Bakr, but he will not remain here 
very long after I am gone. 

Mujahid (may Allah bestow His mercy upon him) is reported as 
having said: ““Ali ibn Abi Talib (may Allah be well pleased with him) 
once said to me: ‘The Prophet (Allah bless him and give him peace) 
did not leave the abode of this lower world until he had informed me: 
“Aba Bakr will be in charge after 1 am gone. Then ‘Umar, then 
“Urhman after him, then “Ali after him.” 


& 


On the Caliphate of “Umar ibn al-Khattab 
(may Allah be well pleased with him). 


s for the Caliphate of “Umar (may Allah be well pleased with 

him), he was designated to succeed to it by Abii Bakr himself (may 
Allah be well pleased with him), so all the Companions willingly agreed 
to pledge their allegiance to him, and to call him by the title “Commander 
of the Believers [Amir al-Mw’ minin].” 

According to the statement of ‘Abdu’llah ibn “Abbas (may Allah be 
well pleased with him and with his father): “They said to Aba Bakr 
(may Allah be well pleased with him): ‘What will you say to your Lord, 
when you meet Him tomorrow [at the Resurrection], knowing that you 
designated ‘Umar to succeed you as our Caliph, even though you were 
fully aware of his coarse severity” But he replied: ‘I shall say: “I 
designated the best of Your people to succeed me as their Caliph.”” 


= 
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On the Caliphate of “Uthman ibn ‘Affan 
(may Allah be well pleased with him). 


xX for the Caliphate of “Urhman ibn ‘Affan (may Allah be well 
pleased with him), it also came about asa result of the unanimous 
agreement of the Companions (may Allah be well pleased with them 
all). The actual process was as follows: “Umar (may Allah be well 
pleased with him) had excluded his own sons from the succession to 
the Caliphate, which he left to be decided by a consultative council 
[shitra] consisting of six individual members, namely Talha, az-Zubair, 
Sa‘d ibn Abi Wagqgis, “Uthman, “Ali, and ‘Abd ar-Rahman ibn ‘Awf. 
“Abd ar-Rahm§n said to ‘Ali and “Urhman: “I shall vote for one of 
you two, for the sake of Allah and His Messenger, and for the sake of all 
the believers.” 

Then he took “Ali (may Allah be well pleased with him) by the hand, 
and said to him: “O ‘Ali, let it be your responsibility to fulfill the 
contract [“ahd] and compact [mithag] of Allah, His covenant [dhimma] 
and the covenant of His Messenger. I hereby pledge you my allegiance, 
trusting that you will conduct yourself in all sincerity, for the sake of 
Allah, for the sake of His Messenger, and for the benefit of all the 
believers, and that you will follow in the footsteps of His Messenger, of 
Aba Bakr, and of “Umar.” But ‘Ali was afraid that he might not 
be strong enough to cope with the negative criticism to which 
he would be exposed, so he declined to accept the offer. “Abd 
ar-Rahman then took “Urhman (may Allah be well pleased with 
him) by the hand, and said to him what he had just said to “Ali. 
“Uthman gave him a positive response, so he stroked the hand of 
“Uthman and pledged his allegiance to him. ‘Ali (may Allah be 
well pleased with him) pledged his allegiance to him also. Then the 
rest of the people all pledged their allegiance. 
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Thus it was that “Uthman ibn “Affan came to be the Caliph of all the 
people, by their unanimous agreement. He wasa rightful leader [imam] 
until the day he died. He was never guilty of anything serious enough 
to justify the attempts to impeach him, the accusations of moral 
depravity, and his eventual assassination, contrary to the claims made 


by the Rawafid2’s (may they be doomed to perdition!) 


a 


298 According to some authorities, the Rawafid [“Deserters"] are so called because they refused to 
give their allegiance to Abt Bakr and “Umar. Others say that the name was first given to a sect 
of the Shi'a in Kiifa, because they deserted their leader, Zaid ibn ‘Alt ibn Husain ibn “Alf ibn Abt 
Talib, when he refused to accede to their demands that he should express abusive condemnation 
of Aba Bakr and “Umar. As used by Sunni Muslims, the term ar-Rawafid (and its synonym 
ar-Rafida) is commonly applied to the Shi‘a as a whole. 


On the Caliphate of “Ali ibn Abi Talib 
(may Allah be well pleased with him). 


A: for the Caliphate of ‘Ali (may Allah be well pleased with him), 
it came about as a result of the general agreement of the community, 
and by the consensus of the Companions. 

This view of the matter is borne out by the traditional report of Aba 
“Abdi’llah ibn Barta, whoattributes the following account to Muhammad 
ibn al-Hanafiyya: 

“I was in the company of ‘Ali ibn Abi Talib while ‘Uthman ibn 
“Affan was under siege. Then a man came along and told us: ‘The 
Commander of the Believers [Amir al-Mu’ minin] was killed just a little 
while ago.’ “Ali (may Allah be well pleased with him) sprang to his feet 
at once, so | grabbed him and held him by the waist, for fear that he 
might do something rash, but he cried: “Let go of me, you motherless 
wretch!” Then ‘Ali went to the palace, where he found that ‘Urhman 
had indeed been slain, so he made his way to his own house, went inside, 
and locked the door. 

“The people came after him, and started hammering on his door. As 
soon as they were admitted inside, they said: ““Uthman has been killed, 
and the people cannot manage without a Caliph. There is no one, as 
far as we know, who is better qualified for the job than youare.’ But “Ali 
responded to this by saying to them: ‘You do not really want me, for I 
can serve you better as a minister [wazir] than as a commander [amtr].’ 

“Still the people insisted: “We know of no one who is better qualified 
for the job than you are.’ So he said (may Allah be well pleased with 
him): ‘Very well, if you insist on leaving me no choice. In any case, the 
fact that homage has been paid to me will not be a secret for long, but 
let me go out to the mosque [masjid], so that all those who wish to pledge 
their allegiance to me may do so there.’ ‘He then left his house (may 
Allah be well pleased with him) and went to the mosque, where the 
people came and pledged their allegiance to him.” 
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bless him and give him peace), comparable to the miracles of the staff 
and the restoration of the dead to life, in the respective cases of Moses 
and Jesus (peace be upon them both). 

Other miracles performed by the Prophet (Allah bless him and give 
him peace) include: 

1. The gushing forth of water from between his fingers.2 

2. Feeding a multitude of people with a tiny supply of food.285 

3. The speaking of the poisoned arm, and its uttering the words: “Do 
not eat from me, for | have been poisoned!” 

4. The splitting of the moon [inshigdg al-qamar].2* 

5. The plaintive cry of the palm tree stump [hanin al-jidh‘].2%7 

6. The speaking of the camel. 

7. The movement of the tree toward him.* 


Of course, these are but a few of the miracles ascribed to him (Allah 
bless him and give him peace), since their total number is said to be as 
high as one thousand. 

There are two good reasons to explain why the Prophet (Allah bless 
him and give him peace) did not reproduce such earlier miracles as the 
staff of Moses and his white hand, the revival of the dead and the 
healing of the leper and the man who had been blind from birth 
[performed by Jesus], the she-camel of Salih,”*’ or the particular miracles 
assigned to the rest of the Prophets [anbiya’]. One of these two reasons 
is that it was necessary to ensure that his Community [umma] would not 


284 The Prophet (Allah bless him and give him peace) is said to have produced this miraculous 
supply of water to quench the thirst of his Companions (may Allah be well pleased with them all) 
during the encounter with the unbelievers of Quraish that resulted in the treaty of al-Hudaibiyya. 
285 According to some traditional reports, the Prophet (Allih bless him and give him peace) once 
fed a thousand people with the meat of one kid and a single measure [sd‘] of barley. 

286 According to some traditional authorities, this miracle is referred to in the verses of the Qur’an 
(54:1,2) which read: “The hour has drawn nigh, and the moon has split in two. Yet if they see 
a miraculous sign [aya] they turn away, and they say: ‘[It is nothing more than] a prolonged trick 
of sorcery [sihrun mustamirr]}.” 

Ar this point in his famous commentary [tafstr] on the Qur’ fn, al-Baidawi says: “Some say that 
the unbelievers demanded this sign of the Prophet (Allah bless him and give him peace), and the 
moon was split in two; but others say it refers to a sign of the coming Resurrection.” 

287 This stump is said to have wept to such an extent thar it almost split in two, because it was so 
disappointed when the Prophet (Allth bless him and give him peace) refrained from leaning 
against it. 

288 Tradition tells that the Prophet (AlLth bless him and give him peace) once fell asleep beneath 
a tree, which then moved with the sun in order to keep him in the shade. According to another 
traditional report, two trees miraculously moved to provide him with shade. 
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daily prayers, and he would utter the Call to Prayer in his presence 
[yu’adhdhinuhu bi’ s-salat]. The Prophet (Allah bless him and give him 
peace) would then say: “Go and tell Abi Bakr that he must lead the 
people in prayer.” 

While he was still alive and well, the Prophet (Allah bless him and 
give him peace) would often speak on the subject of Abi Bakr, (may 
Allah be well pleased with him), in such terms that it became quite 
obvious to all the Companions that he must be the person best qualified 
to assume the Caliphate when the Prophet (Allah bless him and give 
him peace) was no longer with them. It was made equally obvious, in 
respect of “Umar, “‘Uthman and ‘Ali (may Allah be well pleased with 
them), that each of them would be the one best qualified to assume that 
office in his own age and time. 

Among other evidence to this effect, we have the traditional report 
of Ibn Batta, complete with its chain of transmission [isnad] from “Ali 
(may Allah be well pleased with him), who said: “Someone asked: 
‘O Messenger of Allah, whom shall we appoint to be our leader, when 
you are no longer with us?” He replied (Allah bless him and give him 
peace) by saying: 

If you appoint Aba Bakr to be your leader, you will find him trustworthy, 
abstemious in relation to this world, and enthusiastic in relation to the 
hereafter. If you appoint “Umar to be your leader, you will find him strong and 
trustworthy, someone who is not afraid, when the interest of Allah is at stake, 
of criticism from any critic. If you put “Alvin charge, you will find him a good 
guide, who is himself rightly guided. 

“Because of this, they agreed unanimously that the Caliphate should 
go to Abi Bakr.” 

According to another traditional report, handed down to us on the 
authority of our own Imam, Aba ‘Abdi’llah Ahmad ibn Hanbal (may 
Allah bestow His mercy upon him), the validity of the Caliphate of Abi 
Bakr (may Allah be well pleased with him) has been established beyond 
any doubt, by clear and explicit evidence as well as by implication. 

This also happens to be the doctrine [madhhab] of al-Hasan al-Basri 
and a significant group of experts in the tradition [ashab al-hadith]. One 
piece of evidence to support their view is provided by Abi Huraira (may 


above the thirty, they must belong to the Caliphate of Mu‘awiya, which 
lasted for nineteen years and a few months all told, because the first 
period of thirty years was completed by “Ali (may Allah be well pleased 
with him), as we have already explained 


Foe 
te 


2 See p. 255 above. 


On the special respect due to the wives of the 
Prophet (Allah bless him and give him peace), 
who are regarded as the Mothers of the 
Believers, and to his daughter Fatima 
(may Allah be well pleased with her). 


e hold a good opinion of all the wives of the Prophet (Allah 

bless him and give him peace). We firmly believe that they are 
the Mothers of the Believers [Ummahat al-Mu’minin], and that ‘A’ isha 
(may Allah be well pleased with her) is one of the most excellent 
women in the entire universe. Allah (Exalted is He) has declared her 
completely innocent of the charges brought against her by the renegades, 
as we read [in the Qur’an]5% and as people will go on reading until the 
Day of Judgment [Yawm ad-Din]. 

It is likewise true of Fatima, the daughter of our Prophet (Allah bless 
him and give him peace, and may Allah the Exalted be well pleased with 
her, and with her husband and her children), that she is one of the most 
excellent women in the entire universe. It is incumbent upon us to 
regard her with loving care and affection, just as this is incumbent upon 
us with respect to her father (Allah bless him and give him peace). 

As the Prophet (Allah bless him and give him peace) has told us: 


Fatima is a piece of me. | am troubled by anything that troubles her.™ 
© 


33 “They who spread the slander are a gang among you...,” (23:11) and: “They are liars in the 
sight of Allah.” (23:13) 


304 Fatimatu bid “atun minnt—yurtbunt ma yurthucha. 
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no capital letters, and the letters of each word are most often closely 
linked together, in print as well as in handwriting. The name 
“Muhammad” is therefore spelled mhmd, and the whole name in Arabic 
takes up little more space than the initial “M” of the transliterated form. 

Due to significant differences in word formation, grammar and 
syntax, the typical Arabic sentence is more concise than its English 
counterpart. Let me offer just one rather striking example: The single 
Arabic word kabbir (simply kbr in writing or print) means: “Proclaim 
the Supreme Greatness of the One Almighty God.” It is true that “Say, 
‘Allahu Akbar’” could qualify as a translation of sorts—despite being 
two-thirds Arabic—but even this shorter expression is still about five 
times the length of kbr. 

eln the Damascus edition of al-Ghunya, on which the present 
translation is almost entirely based, the editor has supplied about a 
dozen footnotes, all of them quite brief. The contrast here is particularly 
stark, since the translator has provided hundreds of footnotes, many of 
them quite lengthy. 

Enough said, | trust, to explain why Sufficient Provision has been 
published in several volumes. Attention to the subject matter has 
resulted in this particular five-volume set, following dividing lines 
apparent in the structure and contents of the work. 

Of the points that remain to be clarified, the most important concerns 
the editorial treatment of Volume One, where certain subsections have 
been assigned to the Appendices. As for the material presented in 
Appendix 1, this consists of selections from the Book of Good Manners 
[Kitab al-Adab], most of them relating quite specifically to physical 
situations and cultural conditions that are likely to be remote from the 
everyday experience of our readers. In the case of Appendix 2, the 
material was actually classed as supplementary by the author himself, 
when he appended his account of the heretical sects to Chapter Four. 

Regarding the Chapter headings, these are unnumbered in the 
Arabic text, so the numbers One through Seventeen have been 
supplied for convenience. In the Damascus edition, the important 
section on Marriage® is not designated as a Chapter, but it has been 
labeled Chapter Two in this translation, since the author subsequently 
refers to it as a “chapter [bab]."" 
6Vol. 1, pp. 112-50. 
7Vol. 5, p. 75. 
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¢ Hard against the unbelievers. [This is an allusion to] “Umar ibn 
al-Khattab. 

© Merciful one to another. [This is an allusion to] ‘Urhman ibn 
“Affan. 

¢ You see them bowing and falling prostrate [in worship]. [This 
is an allusion to] “Ali ibn Abi Talib. 

© Seeking bounty from Allah and good pleasure. [This is an allusion 
to] Talha and az-Zubair, the two disciples [hawariya]*” of Allah's 
Messenger (Allah bless him and give him peace). 

¢ Their mark is on their faces, the trace of prostration. [This is an 
allusion to] Sa‘d [ibn Abi Waqaqas], Sa‘id [ibn Zaid], ‘Abd ar-Rahman 
ibn ‘Awf, and Abi “Ubaida ibn al-Jarrah. These make up the ten [who 
received glad tidings of the promise of Paradise].** 

¢ Such is their likeness in the Torah and their likeness in the 
Gospel—as a seed that puts forth its shoot. This is a reference to 
Muhammad (Allah bless him and give him peace). 

e And strengthens it by means of Abi Bakr. 

e And it grows stout by means of ‘Umar. 

¢ And it rises straight upon its stalk by means of “Uthman. 

¢ Pleasing the sowers by means of “Ali ibn Abi Talib. 

e That through them—through the Prophet (Allah bless him and 
give him peace )—He may enrage the unbelievers. 


RG €) 
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27 As explained in note ® on p. 83 above, the term hawérl is most often applied to a disciple of 
Jesus (peace be upon him). 
308 See note 293 on p. 255 above. 


On the correctness of adopting a neutral attitude 
toward the conflicts that arose among the 
Companions (may Allah be well pleased 
with them all). 


hose who remain faithful to the orthodox tradition of [slim [ahl 

as-Sunna] are in agreement concerning the necessity of suspending 
judgment on the issues over which the Companions (may Allah be well 
pleased with them all) came into conflict with one another, of adopting 
a neutral attitude toward their faults and failings, and of proclaiming 
their virtues and good qualities. 

As we have already explained, whatever may have been the causes 
and consequences of the discord between “Ali, on the one hand, and 
Talha, az-Zubair, “A’isha and Mu‘Awiya on the other (may Allah be 
well pleased with them all), it is now agreed that the verdict on their 
conduct should be left to Allah (Almighty and Glorious is He), and that 
each should be given credit where credit is due. 

As Allah (Almighty and Glorious is He) has said: 

And as for those who came after them, they say: “Our Lord, forgive us and our 
brothers who were before us in the faith, and do not lodge in our hearts any 


rancor toward those who believe. Our Lord, You are the All-Gentle [ar-Ra’ ij], 
the All-Compassionate [ar-Rahim].” (59:10) 


That is acommunity [1emma] that has passed away. To their credit is that which 
they have earned, and to your credit is that which you have earned. And you 
will not be questioned concerning the things they used todo. (2:134 and 2:141) 


The Prophet (Allah bless him and give him peace) is reported as 
having said: 
When things are said about my Companions, refrain from passing judgment. 
In another version of this traditional report, the words attributed to 
him (Allah bless him and give him peace) are: 
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Beware of [passing judgment on] the conflicts that flare up among my Compan- 
ions, for, even if one of you were to contribute his own weight in gold, it would 
not equal the measure of one of them, nor would it amount to half as much. 


He also said (Allah bless him and give him peace): 
Good betide anyone who has seen me, and anyone who has seen someone who 
has seen me!3? 
Do not revile my Companions. If anyone does revile them, may the curse of 
Allah be upon him! 

According to the traditional account[riwaya] of Anas ibn Malik (may 
Allah be well pleased with him), he also said (Allah bless him and give 
him peace): 

Allah (Almighty and Glorious is He) has chosen me, and He has chosen my 
Companions for me, for He has made them my helpers [ansari] and He has made 


them my relatives by marriage [ashari]. At the end of the age, there will emerge 
a set of people who will try to belittle them. 


On no account should you eat in the company of such people. On no account 
should you drink in their company. On no account should you marry into their 
families. On no account should you perform the ritual prayers together with 
them. On no account should you pray for them. May the curse [of Allah] alight 
upon them! 
As reported by Jabir (may Allah be well pleased with him), Allah's 
Messenger (Allah bless him and give him peace) once said: 
Not one of those who made the pledge of allegiance beneath the tree™” will 
enter the Fire of Hell. 
As reported by Abi Huraira (may Allah be well pleased with him), 
Allah’s Messenger (Allah bless him and give him peace) also said: 
Allah beheld the people [who fought in the battle] of Badr, and He said: “O 
people of Badr, do whatever you will, for | have already forgiven you!” 
Ibn “Umar (may Allah be well pleased with him and with his father) 
reported that Allah’s Messenger (Allah bless him and give him peace) 
once said: 


My Companions are just like the stars, so take the word of any one of them and 
you will be guided aright. 


According to yet another traditional report, this one transmitted by 


30? Taba li-man ra’ ani—wa li-man ra’a man ra’ ani. 


M0 The pledge of allegiance known as Bai‘at ar-Ridwin, which was made beneath a tree at 
al-Hudaibiyya in the sixth year of the Hijra. (See p. 254 above.) 


272 Volume One 


Ibn Buraida on the authority of his father (may Allah be well pleased 
with him), the Prophet (Allah bless him and give him peace) once said: 


If one of my Companions should happen to die in a particular country, he will 
be appointed as an intercessor [shaft ‘] on behalf of the people of that country. 


Sufyan ibn “Uyaina (may Allah bestow His mercy upon him) has said: 
“If anyone has a bad word to say about the Companions of the Prophet 
(Allah bless him and give him peace), he must be the follower of an 
heretical sect.” 


le 


On the duty to obey the leaders of the Muslim 
community, and other matters on which 
a consensus has been reached among 
followers of the orthodox Islamic tradition. 


hose who remain faithful to the orthodox tradition of Islam [ahl 

as-Sunna] are unanimously agreed that it is necessary to 
obey" the leaders [a’ imma] *" of the Muslims, to follow their instructions, 
and to perform the ritual prayer [salat] behind each and every one of 
them, whether he be pious or immoral, and whether he be just or unjust. 
The same obedience is due to all those who have been put in charge, 
or appointed to official posts, or had authority delegated to them by 
the leaders. 

Itis not proper to assert that any member of the Muslim community? 
issure to end up in the Garden of Paradise or in the Fire of Hell, whether 
he be an obedient worshipper or a disobedient sinner, whether he be 
mature and reasonable or reckless and unstable, unless he is demonstrably 
guilty of heretical innovation [bid‘a] and of straying from the path of truth. 

They are unanimously agreed on the assigning of major miracles 
[mu ‘jizat] exclusively to the Prophets [anbiya’], and of lesser miracles or 
charismatic gifts [karamat] to the saints [awliya’). 

They also agree that economic cycles of inflation and deflation 
[al-ghala’ wa'r-rukhs] must be attributed to the workings of Allah, and 
are not subject to the control of any of his creatures, such as the sultans 
and kings, nor to the influence of the stars and planets, as the 
Qadariyya*'* and the Mujassimiin*'* would have us believe. 

The correct view of this matter is based on the traditional report of 
M! Literally, to hear and to obey [as-sam‘ wa't-td a]. 

2 Plural of imam. 
313 Literally, the people who pray toward the Ka‘ba in Mecca [ahl al-Qibla]. 
44 See note 2” on p. 214 above. 
515 The Cosporealists, s0 called because they ascribe a physical body {jism] to Allah: 
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Anas ibn Malik (may Allah be well pleased with him), according to whom 

Allh’s Messenger (Allah bless him and give him peace) once said: 
The market forces of inflation and deflation [al-ghala’ wa’r-rukhs] are two units 
from the armies of Allah. The name of one of them is scarceness [raghba], and 
the name of the other is seariness [rahba].. Whenever Allah wishes to inflate 
the price ofa particular commodity, he instillsa sense of scarceness in the hearts 
of the merchants, so they hoard that commodity. But when He wishes to deflate 
the price of an article, He instills a sense of scariness in the breasts of the traders, 
so they rush to get it off their hands. 

The best course for the shrewd and intelligent believer to take is to 
follow the traditional path, and not to indulge in heretical innovation 
[an yattabi ‘a wa la yabtadi‘a]. He should beware of going to extremes, 
of getting too far out of his depth, and of taking too much upon himself, 
in case he loses his way, makes a serious slip, and meets his downfall. 

As ‘“Abdu'llah ibn Mas‘iid (may Allah be well pleased with him) 
used to say: “Follow the traditional path and do not indulge in heretical 
innovation, then you will stay out of trouble.” 

Mu‘adh ibn Jabal (may Allah be well pleased with him) once said: 
“Beware of probing into matters that are very obscure, and of saying 
‘Whar can this be?’ to everything you come across.” 

When Mujahid (may Allah bestow His mercy upon him) came to 
hear these words of Mu‘adh’s, he remarked: “We had always been 
in the habit of saying ‘What can this be?’ to everything we came 
across, but not anymore!” 

Itisincumbent upon the believer to follow the Sunna and the Jama‘a. 
The Sunna is the exemplary precedent [ma sanna] set by Allah’s 
Messenger (Allah bless him and give him peace), while the Jama‘a is 
the common practice agreed upon by the Companions of Allah’s 
Messenger (Allah bless him and give him peace) during the Caliphate 
of the Four Imams, the rightly guided Caliphs [Aba Bakr, ‘Umar, 
“Uthman and “Ali] (may Allah be well pleased with them all). 

The believer ought not to emulate the heretical innovators [ahl al-bida‘], 
nor should he give them any credit. He should not even salute them 
with the Islamic greeting, because our Imam, Ahmad ibn Hanbal (may 
Allah bestow His mercy upon him) has said: “If someone gives the 
Islamic greeting to a person who is guilty of heretical innovation [sahib 
bid ‘a], he is likely to become fond of that person, in view of the saying 


On the good manners to be observed when traveling 
[adab as-safar] and on how to relate to fellow travelers. 


hen a person is proposing to set out on a journey, a Pilgrimage 

[Hajj] or a military expedition [ghazw], or to move from one 

house to another, or to go in pursuit of something he needs, he should 
perform a ritual prayer of two cycles [yusalli rak‘atain], then embark on 


his quest or undertake his move. 


In the case of a journey, he should say the following words after 


completing the two cycles of prayer: 


O Allah, 

convey a communication 
that conveys a blessing 

and forgiveness from You, 
and a sign of approval. 

All goodness is in Your hand, 
and You are Powerful 

over all things. 


O Allah, You are the 
Companion on the journey, 
and the Deputy in charge of 
the wife, the property and 
the children (left at home). 


O Allah, make the journey 
a smooth one for us, 

and make the distance 
seem short to us. 


O Allah, | take refuge with 
You from hardship 

on the journey, 

and from trouble 

on the way home, 

and from finding that things 
look bad where the wife, 
the children and the 
property are concerned. 


Anta’ -Sahibu f's-safar 
wa'l-! atu 

fi’ Lahli wa "l-mali 

wa’ wuld. 


a‘ tidhu bika 

min wa‘ tha’ i’s-safar: 
wa ka’ abati' |- 
munqalab: 

wa sili’ -manzari 

fi’ -ahli 

wa’ l-wuldi 

wa'l-mal. 
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should beseech Allah (Exalted is He) to forgive his sins, even if his 
good works are few and far between. And if you happen to catch sight 
of an heretical innovator [mubtadi’] on the road, you had better take a 
different road!” 

Fudail ibn “lyad (may Allah bestow His mercy upon him) has also 
said: “I once heard Sufyan ibn “Uyaina (may Allah bestow His mercy 
upon him) saying: ‘If someone follows the funeral procession [janaza] 
of an heretical innovator, he will not cease to be exposed to the displeasure 
of Allah (Exalted is He), until he turns around and leaves it.” 

The Prophet himself (Allah bless him and give him peace) has cursed the 
heretical innovator, for he has said (Allah bless him and give him peace): 

If a person originates a mischievous innovation [ahdatha hadathan], or offers 
hospitality to a mischief-making innovator [muhdith], may he be exposed to the 
curse of Allah, of the angels, and of all mankind, and may Allah refuse to accept 
from him both the full load [sirf] and the half load [“idl]. 

Whar he means by ‘the full load’ is the performance of obligatory 
religious duty [farida], and by ‘the half load’ he means the supererogatory 
act of worship [nafila]. 

Aba Ayyab as-Sijistani (may Allah bestow His mercy upon him) is 
reported as having said: 

“If a man is told about the Sunna, and he responds by saying: ‘Never 
mind about that! Talk to us about what is in the Qur’an,’ you can be 


sure that he has gone astray.” 
y 
-— s : 


On the distinctive features of the various 


sects responsible for the introduction of 
heretical innovations [ahl al-bida‘]. 


bes need to be aware that the various sects responsible for the 
introduction of heretical innovations [ahl al-bida‘] have 
distinctive features by which they can be recognized. 

The distinctive feature of the proponents of a particular heretical 
innovation can be identified by noting the term of contempt by 
which they choose to misrepresent the loyal followers of the 
Tradition [ahl al-Athar]. 

In the case of the Freethinkers [Zandadiga],*"* it is the fact that they 
refer to the followers of the Tradition as the “Trash Collectors” 
[Hashwiyya],?"7 and that they would like to see the traditions [athar] 
dismissed as completely worthless. 

The distinctive feature of the Qadariyya*"* is the fact that they refer 
to the followers of the Tradition as “Compulsives” [Mujbara] 5" 

In the case of the Jahmiyya*”°it is the fact that they refer to the followers 
of the Tradition*”! as “Anthropomorphists” [Mushabbiha] .*2* 


316Some authorities derive the word Zindig (of which Zanddiqa is the plural form) from the Persian 
Zand or Zend, meaning the commentary on the book of Zoroaster. The Arabic lexicographers give 
awide of meanings for Zdig, including: “Dualist; atheist; one who believes in the comeey 
of this world; a follower of Mani or Mazdak; one who does not follow any religion; a freeth 

(See: EW. Lane, Arabic-English Lexicon, art. 5 agp ) According to L. Massignon: the 
polemics of the conservatives describe as a zi .-any one whose external profession of Islam 
seems to them not sufficiently sincere.” (See: SEI, art. INDIO.) 

317 According to the anonymous author of the article HASHWIYA in SEI: * Arse as a 
contemptuous term for those among the men of Tradition (ashab al-hadith) who recognized 
coarsely anthropomorphic traditions as genuine, without criticism and even with a re pe of 
preference, and interpreted them literally.” 

318 On the Qadariyya, see note 2” on p. 214 above. 

319 The Mujbara (also known as the Jabariyya) are a sect who condone “compulsive” behavior, 
since they assert that man has no power ni tndacs to control his ownactions, and that Allah 
compels His servants to commit sins. 

320 On the Jahmiyya, see note !*5 on p. 183 above. 

321 In this particular sentence, the author uses the term ahi as-Sunna, rather than ahi al-Athar, which 
is otherwise repeated throughout the paragraph. The terms are clearly synonymous i in this context. 


322 On the Anthropomorphists [Mushabbiha], see note '7! on page 178 above. 
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no bigger than the tip ofa man’s thumb will then be produced. (As he was saying 
this, the Prophet squeezed half of his own thumb between his fingers.)?* 


It will contain the declaration of faith [shahada], testifying that there is no god 
but Allah, and thar Muhammad is the Messenger of Allah [an [a ilaha illa’ lah wa 
anna Muhammadan Rasiilu’llah]. This will be placed in the scale of his good 
deeds, causing his good deeds to weigh heavier than his evil deeds, so the order 
will then be given for him to be led off to the Garden of Paradise277 

It has been said that the weights [sanj] to be used on that Day [of 
Resurrection] will be those that are now used to measure tiny grains and 
mustard seeds. 

Good deeds will take the form of some beautiful object, which will be 
cast into the scale of radiant light. The Balance will then record the 
weight of it as heavy, because of the mercy [rahma] of Allah. Evil deeds 
will take the form of some foul object, which will be cast into the scale 
of darkness. The Balance will then register the weight of it as light, 
because of the justice [“adl] of Allah (Exalted is He). 

The Balance will indicate the presence of a heavy weight by the 
upward motion of the scale, and of a light weight by the downward 
motion of the scale, contrary to the balances of this world. What causes 
it to register a heavy weight will be faith [tman] and the utterance of the 
twofold declaration of belief [qawl ash-shahadatain], and what causes it 
to register a light weight will be the ascription of partners[shirk] to Allah 
(Almighty and Glorious is He). If the scale moves upward, the person 
concerned will be admitted to the Garden of Paradise, because it is up 
on high. But if it registers a light weight, the person concerned will be 
made to enter the bottomless pit of the Fire of Hell, because it is in the 
region of the lowest of the low. As Allah (Almighty and Glorious is He) 
has said: 

Then he whose deeds weigh heavy in the scales shall inherit a pleasing life, but 
he whose deeds weigh light in the scales, his mother shall be the Pit. (101:6-9) 


In other words, the former shall dwell in a Garden of Paradise on high, 


26 This sentence represents a parenthetic observation by the narrator of the tradition [hadith], 
interjected to give graphic effect to the words of the Prophet (Allth bless him and give him peace). 
277 Author's note: In aslightly different version, the wording is: “A scrap of paper [qirtis} no bigger 
than this—the Prophet squeezed his thumb to demonstrate—will be produced on his behalf. It 
will contain the declaration of faith [shahddal, testifying that there is no god but Allah, and that 
Muhammad is the Messenger of Allth [an lailatha dla’ llah—wa anna Muh Rasiilu’ llah]....” 

The rest of the tradition (hadith) is narrated in the same words as the version quoted in the text. 


On the Caliphate of “Umar ibn al-Khattab 
(may Allah be well pleased with him). 


s for the Caliphate of “Umar (may Allah be well pleased with 

him), he was designated to succeed to it by Abii Bakr himself (may 
Allah be well pleased with him), so all the Companions willingly agreed 
to pledge their allegiance to him, and to call him by the title “Commander 
of the Believers [Amir al-Mw’ minin].” 

According to the statement of ‘Abdu’llah ibn “Abbas (may Allah be 
well pleased with him and with his father): “They said to Aba Bakr 
(may Allah be well pleased with him): ‘What will you say to your Lord, 
when you meet Him tomorrow [at the Resurrection], knowing that you 
designated ‘Umar to succeed you as our Caliph, even though you were 
fully aware of his coarse severity” But he replied: ‘I shall say: “I 
designated the best of Your people to succeed me as their Caliph.”” 


= 
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should beseech Allah (Exalted is He) to forgive his sins, even if his 
good works are few and far between. And if you happen to catch sight 
of an heretical innovator [mubtadi’] on the road, you had better take a 
different road!” 

Fudail ibn “lyad (may Allah bestow His mercy upon him) has also 
said: “I once heard Sufyan ibn “Uyaina (may Allah bestow His mercy 
upon him) saying: ‘If someone follows the funeral procession [janaza] 
of an heretical innovator, he will not cease to be exposed to the displeasure 
of Allah (Exalted is He), until he turns around and leaves it.” 

The Prophet himself (Allah bless him and give him peace) has cursed the 
heretical innovator, for he has said (Allah bless him and give him peace): 

If a person originates a mischievous innovation [ahdatha hadathan], or offers 
hospitality to a mischief-making innovator [muhdith], may he be exposed to the 
curse of Allah, of the angels, and of all mankind, and may Allah refuse to accept 
from him both the full load [sirf] and the half load [“idl]. 

Whar he means by ‘the full load’ is the performance of obligatory 
religious duty [farida], and by ‘the half load’ he means the supererogatory 
act of worship [nafila]. 

Aba Ayyab as-Sijistani (may Allah bestow His mercy upon him) is 
reported as having said: 

“If a man is told about the Sunna, and he responds by saying: ‘Never 
mind about that! Talk to us about what is in the Qur’an,’ you can be 


sure that he has gone astray.” 
y 
-— s : 


On the good manners to be observed when traveling 
[adab as-safar] and on how to relate to fellow travelers. 


hen a person is proposing to set out on a journey, a Pilgrimage 

[Hajj] or a military expedition [ghazw], or to move from one 

house to another, or to go in pursuit of something he needs, he should 
perform a ritual prayer of two cycles [yusalli rak‘atain], then embark on 


his quest or undertake his move. 


In the case of a journey, he should say the following words after 


completing the two cycles of prayer: 


O Allah, 

convey a communication 
that conveys a blessing 

and forgiveness from You, 
and a sign of approval. 

All goodness is in Your hand, 
and You are Powerful 

over all things. 


O Allah, You are the 
Companion on the journey, 
and the Deputy in charge of 
the wife, the property and 
the children (left at home). 


O Allah, make the journey 
a smooth one for us, 

and make the distance 
seem short to us. 


O Allah, | take refuge with 
You from hardship 

on the journey, 

and from trouble 

on the way home, 

and from finding that things 
look bad where the wife, 
the children and the 
property are concerned. 


Anta’ -Sahibu f's-safar 
wa'l-! atu 

fi’ Lahli wa "l-mali 

wa’ wuld. 


a‘ tidhu bika 

min wa‘ tha’ i’s-safar: 
wa ka’ abati' |- 
munqalab: 

wa sili’ -manzari 

fi’ -ahli 

wa’ l-wuldi 

wa'l-mal. 
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“Ad: (1) Outstripping. (2) Admirably generous or excellent. 
(3) Fine, delicate, comely. (4) Emancipated, freed from 
slavery. (5) Old, ancienr.** (6) Wine. 

Faqth: (1) Possessing knowledge and understanding. (2) A legist, 
an expert in the canonical jurisprudence [figh] of Islam. 

Fahim: Good at comprehending, shrewd, sensible, discerning, 
judicious, intelligent. 

Fatin: Clever, smart, astute, sagacious, perspicacious, bright, 
intelligent. 

Muhaggqiq: (1) Scrutinizing, meticulous, careful to ascertain the true 
facts. (2) A literary editor and critic. 

“Agil: (1) Intelligent, reasonable, understanding, sensible, rational, 
discerning, prudent, judicious, wise. (2) In full posses- 
sion of one’s mental faculties, compos mentis, sane in 
mind. (3) One who pays the bloodwit to the heirs of a 
person who has been killed unintentionally. 

Muwaggqar: (1) Venerable, reverend. (2) Heavily laden. 

Tayyib: (1) Good, pleasant, agreeable. (2) Well-disposed, friendly, 
kindly. (3) Delicious, tasty. 

There are some, however, who maintain that Tayyib is permissible as 

an epithet for Allah (Almighty and Glorious is He.) 


382 Many of the Arabic words in this list have several possible meanings, orat least several distinctly 
significant shades of meaning. This makes it rather difficult for the translator to offer a single word, 
orevena short phrase, asa reasonably exact English equivalent of each Arabic term. In some cases, 
the author himself (may Allah be well pleased with him) has drawn attention toan ambiguity, and 
has added a helpful note of clarification. For a satisfactory understanding of this passage, however, 
it seems necessary to provide a full range of possible meanings and nuances for each of the Arabic 
terms listed here as inappropriate epithets for Allah (Almighty and Glorious is He). 

333 Tn addition to his more commonly known appellation as-Siddig [the Champion of Truth], Aba 
Bakr (may Allah be well pleased with him) was also given the surname al-‘ Ang. According to one 
traditional account, this was in recognition of his having been told by the Prophet (Allah bless 
him and give him peace) that he was emancipated ‘ariq] from the Fire of Hell and assured of a place 
in the Garden of Paradise. 


334 The adjective ee ea interpreted in this sense when it occurs in the Qur’sn as an 
appellation of the Kaba, because it was the first house founded upon the earth: 


The first House established for mankind was that at Beoca [= Meccal, a blessed place, and a 
guidance to all beings. (3:96) 

Let them then finish with their self-neglect and let them fulfill cheir vows, and let them go 
around the Ancient House [al-Bait al-‘Ang}. (22:29) 


There are things therein profitable to you unto stated term; and afterward their lawful place 
of sacrifice is by the Ancient House [al-Bait al-‘Adg]. (22:33) 
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‘Adiyy: Old, ancient. 

This cannot be appropriate, because it is the adjective applied to the 
era of the [ancient and extinct] tribe of “Ad, which was brought into 
existence at a certain point in time [muhdath]. 

Mutig: Competent, capable, possessing a faculty or capability 

[aga]. 

This cannot be applied to Allah (Almighty and Glorious is He), 
because He is the Creator of every faculty or capability [taga], and all 
faculties or capabilities are finite [mutanahiya]. 

Mahfiiz: (1) Kepr safe, preserved, treasured, taken care of. 

(2) Committed to memory. 

This cannot be appropriate as an epithet for Allah (Almighty and 
Glorious is He), because He is the Preserver [al-Hafiz] 5% 

Sexual intercourse [mubdshara] cannot be attributed to Him.5* 

It is not permissible to describe Him as muktasib [acquisitive],3*7 because 
an acquisition is brought into existence at a given point in time 
[muhdath], by means of a capacity for novel invention [qudra muhditha], 
and Allah (Exalted is He) is far above and beyond anything of this kind. 

Nonexistence [“adam] is inconceivable with respect to Him, for He is 
Infinitely Pre-Existent [Qadtm Ia bi-gidam]. There is no starting point 
to His existence [la awwal li-wujiidih], contrary to the assertion made by 
Ibn Kullab, to the effect that He is Pre-Existent, but not infinitely so 
[Qadim bi-gidam]. 

He is also Infinitely Enduring [Bagin la bi-baga’]. Infinite are the facts 
of which He (Almighty and Glorious is He) has knowledge, and infinite 
are the destinies which He has the power to determine,*™ contrary to 
the doctrine propagated by the Mu‘ tazila,##? who would have us believe 
that these are finite. 

335 Hafiz is the active participle of the root h-f-z, while mahfiiz is the passive participle of that same 


root. The latter is applied in the Qur'an (85:22) to the “Treasured Tablet” [Lawh Mahfiiz], on 
which the Divine decrees have been inscribed from all eternity. 

36 While the term muhdshara does usually refer specifically to “the [enjoyment of] contact with 
awoman, skin toskin,” with the implication that this involvessexual intercourse, itis possible that 
this statement is intended as a more general refutation of certain anthropomorphic doctrines, in 
which case it should be rendered: “It is not permissible to attribute any form of physical contact 
{mubdshara] to Him.” 


337 According to the Arabic lexicographers, the full sense of the term muktasib is: “One who applies 
himself with skill and diligence to get for obtain, or acquire, or gain, or earn] sustenance.” (See: 
E.W. Lane, Arabic-English Lexicon, art. K-S—B.) 


38 Huwa (‘azza wa jalla) “Alimun bi-ma'‘limatin ghairi mutandhiya—Quidirum bi-magdiiratin ghaivi 
mutandhiya. 


3 See note !™4 on p. 178 above. 
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As for those attributes which it is permissible to ascribe to Him 
(Almighty and Glorious is He), they include happiness/[farah], laughter 
[dahk] , anger[ghadab], displeasure [sakhat] and contentment[rida]. (We 
have already mentioned this at the beginning of the chapter.) 

It must be permissible to refer to Him as “[there to be] Found” 
[Mawyjiid] 3 since He has said (Almighty and Glorious is He): 

Andas for those who disbelieve, their works are as a mirage in a spacious plain. 
The thirsty man supposes it to be water, till he comes to it and finds it is nothing, 
and instead of it he finds Allah [wajada'llaha “indahu}. (24:39) 

It must also be permissible to refer to Him as “a Thing” [Shai’], in view 

of His own words (Exalted is He): 


Say: “What thing [ayyu shai’in] is greatest in testimony!” Say: “Allah.” (6:19) 


He may properly be referred to as “a Person” or “an Individual” [Nafs 
or Dhat or “Ain], 3 without any anthropomorphic comparison being 
implied, as we have previously explained. 

He may be referred to as “Being” [Ka in] ,3# without specific defini- 
tion, in view of His own words (Exalted is He): 


And Allah is Aware of all things [kana’lahu bi kulli shai’in “Altma]. (33:40) 
And Allah is Watchful over all things [kana’Mahu bi kulli shai’ in Ragiba). (33:52) 
It is permissible to characterize Him as all of the following:*' 


Qadim: Ancient [without beginning], Infinitely Pre-Existent, 
Sempiternal.}# 
Bagin:5 —_— Everlasting, Eternal. 


40 As the passive participle of the verb wajada, the word maujiid means simply “found.” By 
extension, it then comes to signify: (1) There to be found, available, on hand, existing, existent. 
(2) Actually present. (3) A living being. 

41 As used in non-specific contexts, these three Arabic words are virtual synonyms, and, like “a 
person” or “an individual” in corresponding English usage, they mean little more than “someone.” 


42 The Arabic word ka’in is the participle of the verb “to be” [kana], so it could also be translated 
as “one who is.” Here, as in many instances in this section of his work, the author (may Allah be 
well pleased with him) is making a point which defies straightforward translation into English, 
since it is so inextricably bound up with the subtleties of Arabic word and sentence formation. 
443 As in the case of the impermissible epithets listed previously (see pp. 281-84 above), many of 
these Arabic terms have a wide range of possible meanings and nuances. In this case, however, 
the translator has confined himself to supplying only the more directly relevant English 
equivalents, since the author himself (may Allsh be well pleased with him) has provided quite 
extensive notes of explanation in most instances. 

344 As an ordinary adjective, gadim means “old; ancient"—as in mail qadim [old, or long-possessed, 
property.} As an epithet applied to Allah, al-Qadim is often reinforced by the addition of al-Azalt 
[Existing from all eternity]. 

¥45 Since the final -n is not a root consonant, the word assumes the form al-Bag! when the definite 
article is prefixed to it. 
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no capital letters, and the letters of each word are most often closely 
linked together, in print as well as in handwriting. The name 
“Muhammad” is therefore spelled mhmd, and the whole name in Arabic 
takes up little more space than the initial “M” of the transliterated form. 

Due to significant differences in word formation, grammar and 
syntax, the typical Arabic sentence is more concise than its English 
counterpart. Let me offer just one rather striking example: The single 
Arabic word kabbir (simply kbr in writing or print) means: “Proclaim 
the Supreme Greatness of the One Almighty God.” It is true that “Say, 
‘Allahu Akbar’” could qualify as a translation of sorts—despite being 
two-thirds Arabic—but even this shorter expression is still about five 
times the length of kbr. 

eln the Damascus edition of al-Ghunya, on which the present 
translation is almost entirely based, the editor has supplied about a 
dozen footnotes, all of them quite brief. The contrast here is particularly 
stark, since the translator has provided hundreds of footnotes, many of 
them quite lengthy. 

Enough said, | trust, to explain why Sufficient Provision has been 
published in several volumes. Attention to the subject matter has 
resulted in this particular five-volume set, following dividing lines 
apparent in the structure and contents of the work. 

Of the points that remain to be clarified, the most important concerns 
the editorial treatment of Volume One, where certain subsections have 
been assigned to the Appendices. As for the material presented in 
Appendix 1, this consists of selections from the Book of Good Manners 
[Kitab al-Adab], most of them relating quite specifically to physical 
situations and cultural conditions that are likely to be remote from the 
everyday experience of our readers. In the case of Appendix 2, the 
material was actually classed as supplementary by the author himself, 
when he appended his account of the heretical sects to Chapter Four. 

Regarding the Chapter headings, these are unnumbered in the 
Arabic text, so the numbers One through Seventeen have been 
supplied for convenience. In the Damascus edition, the important 
section on Marriage® is not designated as a Chapter, but it has been 
labeled Chapter Two in this translation, since the author subsequently 
refers to it as a “chapter [bab]."" 
6Vol. 1, pp. 112-50. 
7Vol. 5, p. 75. 
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proverbial saying: “As you repay, so shall you be repaid [kama tadinu 
tudan],” and Allah describes Himself in the Qur’an as “Master of the 
Day of Din”*!—that is, of the Day of Reckoning. 

The term may also be applied to Him in the sense that He is the Lawgiver 
[Shari‘] for His servants, meaning that He has prescribed a form of religious 
worship [“ibada] and a sacred law [shar‘Ta], that He has summoned them to 
follow it as their obligatory duty, and that He will then recompense them 
according to how they act in fulfillment of that duty. 

Mugaddir: One who determines or decrees. (1) In the sense of 

predetermination [taqdir], as in His words (Almighty and 
Glorious is He): 


Surely We have created everything in predetermined measure [bi-qadar]. 
(54:49) 
Glorify the Name of your Lord the Most High, who created and shaped, who 
determined [qaddara] and guided. (87:1-3) 
(2) In the sense of information [khabar], as in His words 
(Almighty and Glorious is He): 
Except his wife, of whom We had decreed [gaddarna] that she should be among 
those who stay behind. (15:60) 

In other words: “We had informed [akhbarna] Lot (peace be upon 
him) that his wife was one of those who would remain behind in 
torment, while the rest of his family escaped with him.” 

When used in reference to Allah, the term must not convey the 
meaning of guesswork and doubtful speculation [which it may carry in 
some ordinary contexts].3 Exalted is Allah, far above and beyond 
anything of the kind! 

Nazir: Observer. 

This is permissible in the sense that He is One who sees [Ra’in], who 
consciously perceives [Mudrik] things, but not when it is synonymous 
with mutarawwin [one who ponders and reflects] or mutafakkir [one who 
cogitates and meditates]. 

Shafig: Kind, Compassionate. 

It is permissible in this sense, when it refers to His mercy and gentle 
kindness toward His creatures, but not when it bears implications of fear 
and sadness.**3 
35! Maliki Yawmi’d-Dim. (1:4) 

382 The Arabic verb gaddara (of which tagdir is the verbal noun, and mugaddir the active participle) 
has a wide range of possible meanings. In some contexts it signifies “ro surmise, guess, assume, 
presume, suppose...” 

353 A< an ondinary adjective, shafiq sometimes means “sympathetic, affectionate, tender, kind,” but 
it may also mean “fearful, filled with unhappy misgivings.” 
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Rafig: Gentle, Tender. 
This is permissible only when intended to convey the sense of His 
compassion and tender disposition toward His creatures, not if it would 
be understood as referring to the efficient and convenient handling of 
worldly affairs, and to ensuring immunity from their consequences.3*4 

Sakhiyy: Liberal, Bountiful, Munificent, Generous. 

Karim: Generous, Liberal, Munificent, Beneficent. 

Jawad: Liberal, Generous, Bountiful, Munificent. 

Sakhiyy is permissible in this sense, as a synonym of Karim and Jawad, 
because the meaning common to all three terms is Allah’s bountiful 
favor [tafaddul] and beneficence [ihsan] toward His creatures. Itmust not 
be intended, however, to convey the notion of softness [rakhawa] and 
pliabiliry [lm], which it may sometimes signify in ordinary linguistic usage, 
as for instance in the expressions ard sakhiyya [a piece of earth that is soft 
and easy to work] and qirtas sakhiyy [a pliable sheet of paper]. 

Amir: Commanding; One who issues positive commandments. 

Nahin:3% — Prohibiting; One who lays down negative commandments 

or prohibitions. 

Muharrim: Illegitimating; One who declares certain things, and 

certain actions, to be illegitimate or unlawful [haram]. 


Farid: Obligating; One who imposes obligatory religious duties 
[fara’id]. 

Mulzim: — Enjoining; One who makes the fulfillment of certain 
obligations compulsory. 

Miajib: Necessitating; One who places His servants under the 


necessity of performing certain duties. 

Nadib: Recommending; One who urgently recommends certain 
forms of behavior, even if He has not prescribed them as 
absolutely essential. 


354 As an ordinary adjective, rafig means: (1) Gentle, soft, tender, gracious, courteous, civil. (2) 
Neat or skillful in work or operation, whence the expression hadha’l-amre rafiqun bika (“This 
business is easy, or convenient, for you."](As a noun, rafiq means “a companion, [especially] a 
traveling companion.") 

355 In these and similar expressions, as the author notes, the adjective sakhiyy (or sakhiyya, when 
it is required to agree with a grammatically feminine noun, such as ard) is synonymous not with 
kartm and jawéd, but with layyi, meaning “soft; flabby, feeble; flexible, pliable, yielding; pliant, 
supple, resilient, elastic, tractable.” 

356 Since the final -n is not a root consonant, the word assumes the form al-Nahi (pronounced 
an-Naht) when the definite article is prefixed to it. 
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Murshid: Director, Guide; One who shows His servants the right 

path. 

Qadin:* ~— Judge; One who preordains the fate [qada’] of His creatures. 

Hakim: Ruler, Governor. (As we have already mentioned elsewhere 

in this work.359) 

Wa ‘id: Promising; One who promises to reward His servants for 

obedient conduct. 

Mutawa‘‘id: Threatening, Menacing; One who threatens to punish 

His servants for disobedient conduct. 

Mukhawwif, Fear-inspiring; One who puts His servants in fear of dis- 

obeying Him. 

Muhadhdhir: Cautioning; One who warns His servants to beware of 

disobeying Him.3%* 

Dhamm: Blaming, Dispraising; One who gives blame where blame 

is due. 

Madih: Praising, Commending; One who gives praise where praise 

is due. 

Mukhatib: Conversing, Discoursing; One who may address a servant 

of His in intelligible speech. 

Mutakallim: Speaker; One who uses the spoken word. 

Qa il: Sayer; One who utters sayings in language intelligible 

to His human creatures. 

All of these [Mukhatib, Mutakallim and Qa il] can be traced back to 
the same concept, namely that the faculty of speech [kalam] is rightly 
attributed to Him. 

Mu‘dim: Annihilating; One who can cause things to cease to exist. 

[In some contexts, however, the same word means 
“deprived, destitute, reduced to nought, unmade.” This 
passive meaning is relevant to the first sentence in the 
following paragraph.] 

It is permissible to characterize Him as Mu‘dim in the sense that He 
has not been caused to exist [lam yiijad] and has not been made [lam 
yuf al]. The same epithet is also appropriate in the sense that He is One 
387 Since the final -n is not a root consonant, the word assumes the form al-Qad! when the definite 
article is prefixed to it. 

358 See p. 172 above. 


359 The verb yuhadhdhiru, of which muhadhdhir is the corresponding active participle, occurs twice 
in the Qur'an: 
Albh warns you to beware [yuhadhdhiridsem] of Himself. (3:28 and 30) 
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who annihilates that which He has brought into existence 
[md awjadahu], after having caused it to exist [ba ‘da jadihi], by depriving 
it of continued survival, with the result that it becomes nonexistent 
[yan‘adim]. 

Fa‘ il: Active; Effective; Doer; One who does things; One who 

effectively brings things about. 

It is permissible to characterize Him as such in the sense that He is 
One who originates the essence of whatever He does [Mukhtari‘un 
li-dhati ma fa‘ala], who creates it [Khaliqun lah], and who brings it into 
being through His power [Ja‘il bi-qudratih]. 

Only in this sense can the epithet properly be applied to Him, not in 
the sense of physical contact [mubashara] with things, because what this 
actually signifies is the mutual encounter and contiguity of material 
substances, and Allah (Glory be to Him) is Exalted far above and 
beyond anything of the kind! 

Jail: One who brings things about; One who can cause a thing 

to assume a certain form, or to serve a particular purpose. 

It is permissible to characterize Him as such in the sense that He is 
One who acts[Fa‘il], and whose action produces a result [fi‘luhu maf iil). 
To quote His own words (Exalted is He): 

And We have brought about [ja‘alna]>® the night and the day, to serve as two 
signs. (17:12) 

The act of bringing about [ja‘l] may also convey the sense of deciding 

by decree [hukm]. For instance, He has said (Almighty and Glorious is He): 
We have decreed that it should be [ja‘alnahu] an Arabic Qur’an. (43:3) 


Tarik: Canceller; One who may cancel out what He has done. 
[In ordinary contexts, tarik often means “leaving, aban- 
doning; abstaining, refraining.”] 

It is in fact permissible to characterize Him as One who may cancel 
things out, just as He has been characterized as One who effectively 
brings things about [Fa‘il], in the sense that He may do something that 
is the opposite of something else He has done, as a substitute for what 
He did in the first place [badalan min al-awwal], by means of His 
universal and all-embracing power. This epithet may not be applied to 
Him, however, if it could be understood to convey the [more ordinary] 


3 Ja‘il is the active participle corresponding to the verb ja‘ala, of which the form ja‘alnd is used 
when the grammatical subject is first person plural. 
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meaning of curbin

Step 5 Eliminate low rewarding arms: 


Aei = fa E Ae: max(ĝ;, b _ a) < 2er} : 


bE A, 


Step 6 £ + £+1 and Goto Step 1 


Algorithm 12: Phased elimination with G-optimal exploration. 


where C > 0 is a universal constant. If 8 = O(1/n), then E[Rn] < Cy/ndlog(kn) 
for an appropriately chosen universal constant C > 0. 


The proof of this theorem follows relatively directly from the high-probability 


correctness of the confidence intervals used to eliminate low-rewarding arms. We 
leave the details to the reader in Exercise 22.1. 


Notes 


1 The assumption that the action set does not change is crucial for Algorithm 12. 


Several complicated algorithms have been proposed and analysed for the case 
where A; is allowed to change from round to round under the assumption that 
|4| < k for all rounds. For these algorithms, it has been proven that 


R, =O ( ndìog* (nk) ) . (22.1) 


When k is small, these results improve on the bound for LinUCB in Chapter 19 
by a factor of up to Vd. 


2 Algorithm 12 can be adapted to the case where k is infinite by using confidence 
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intervals derived in Chapter 20. Once the dust has settled, you should find the 
regret is 


R, =O (dy/nlog(n)) ; 


3 One advantage of Algorithm 12 is that it behaves well even when the linear 
model is misspecified. Suppose the reward is X; = (0, Ay) + m + f(A), where 
m is noise as usual and f : A — R is some function with ||f||.. < £. Then the 
regret of Algorithm 12 can be shown to be 


R, =O (van log(nk) + nev dlog(n)) : 


The linear dependence on the horizon should be expected when k is large. The 
presence of Vd in the second term is unfortunate, but unavoidable in many 
regimes as discussed by Lattimore and Szepesvari [2019b]. 


Bibliographic Remarks 


The algorithms achieving Eq. (22.1) for changing action sets are SupLinRel [Auer, 
2002] and SupLinUCB [Chu et al., 2011]. These algorithms assume that the 
action set sequence is non-random. They are also based on elimination, but use a 
sophisticated device to decouple the dependence of the design on the outcomes. 
Recently, Li et al. [2019c] refined these algorithms and proved that the minimax 
regret in this changing action-set context is at least Q(,/dnlog(n/d) log(k)), 
which they also matched with an upper bound up to an iterated logarithm term 
(in n), and with the exception that log(n/d) is replaced by log(n). Unfortunately 
the analysis of these algorithms is long and technical, which prohibited us from 
presenting the ideas here. These algorithms are also not the most practical relative 
to LinUCB (Chapter 19) or Thompson sampling (Chapter 36). Of course this 
does not diminish the theoretical breakthrough. 

Phased elimination algorithms have appeared in many places, but the most 
similar to the algorithm presented here is the work on spectral bandits by Valko 
et al. [2014] (and we have also met them briefly in earlier chapters on finite-armed 
bandits). None of the works just mentioned used the Kiefer-Wolfowitz theorem. 
This idea is apparently new, but it is based on the literature on adversarial 
linear bandits where John’s ellipsoid has been used to define exploration policies 
[Bubeck et al., 2012]. For more details on adversarial linear bandits, read on to 
Part VI. 

Ghosh et al. [2017] address misspecified (stochastic) linear bandits with a fixed 
action set. In misspecified linear bandits, the reward is nearly a linear function of 
the feature vectors associated with the actions. Ghosh et al. [2017] demonstrate 
that in the favourable case when one can cheaply test linearity, an algorithm 
that first runs a test and then switches to either a linear bandit or a finite-armed 
bandit based on the outcome will achieve (Vk A d)\/n regret up to log factors. 
We will return to misspecified linear bandits a few more times in the book. 
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Exercises 


22.1 In this exercise, you will prove Theorem 22.1. 


(a) Use Theorem 21.1 to show that the length of the ?th phase is bounded by 


2d jog (= 2) | d(d + 1) 


E? ô 2 


Te < 


(b) Let a* € argmax,. 4(6.,a) be the optimal arm and use Eq. (21.1) to show 
that 


ô 
P (exists phase £ such that a* ¢ Ag) < F 


(c) For action a define Za = min{@: 2ep < Aa} to be the first phase where the 
suboptimality gap of arm a is larger than 2e. Show that 


P(ae Ay) <2. 


(da) Show that with probability at least 1 — 6 the regret is bounded by 


Rn < oy) von (“ee : 


where C > 0 is a universal constant. 
(e) Show that this implies Theorem 22.1 for the given choice of 6. 


22.2 (MISSPECIFIED LINEAR BANDITS) Assume the reward satisfies X; = 
(0, Ar) +m + f(A), where m is 1-subgaussian noise as usual and f : A > R is 
some function with ||f||,. < €, show that the expected regret of Algorithm 12 


with the choice 6 = 1/n is 


R, =O (Van log(nk) + neVdlog(n)) ; 
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Stochastic Linear Bandits with 
Sparsity 


In Chapter 19 we showed the linear variant of UCB has regret bounded by 
Ry = O(dyAlog(n)) 


which for fixed finite action sets can be improved to 


Rn = O(\/dnlog(nk)) . 


For moderately sized action sets, these approaches lead to a big improvement 
over what could be obtained by using the policies that do not make use of the 
linear structure. 

The situation is still not perfect, though. In typical applications, the features 
are chosen by the user of the system, and one can easily imagine there are many 
candidate features and limited information about which will be most useful. This 
presents the user with a challenging trade-off. If they include many features, then 
d will be large, and the algorithm may be slow to learn. But if a useful feature is 
omitted, then the linear model will almost certainly be quite wrong. Ideally, one 
should be able to add features without suffering much additional regret if the 
added feature does not contribute in a significant way. This can be captured by 
the notion of sparsity, which is the central theme of this chapter. 


Sparse Linear Stochastic Bandits 


Like in the standard stochastic linear bandit setting, at the beginning of round t, 
the learner receives a decision set A; C Rt. T hey then choose an action A; € A; 
and receive a reward 


Xt = (Ox, At) + nt, (23.1) 


where (7); is zero-mean noise and 6, € R? is an unknown vector. The only 
difference in the sparse setting is that the parameter vector 0 is assumed to have 
many zero entries. For 6 € R? let 


d 
lôllo = X 1{0: #0} , 


i=1 
which is sometimes called the zero-‘norm’ (quotations because it is not really a 
norm; see Exercise 23.1). 
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ASSUMPTION 23.1. The following hold: 


(a) (Sparse parameter) There exist known constants mo and mg such that 
||@: Ilo < mo and ||6.||2 < me. 

(b) (Bounded mean rewards) (0,,a) < 1 for all a € A; and all rounds t. 

(c) (Subgaussian noise) The noise is conditionally 1-subgaussian: 


for all A € R, D[exp(Anz) | Fi—1] < exp(A?/2) a.s., 
where Fi = oldı, Xı, ...} At, Xt, At+1). 


Much ink has been spilled on what can be said about the speed of learning in 
linear models like (23.1) when (A;); are passively generated and the parameter 
vector is known to be sparse. Most results are phrased about recovering 0,, but 
there also exist a few results that quantify the error when predicting X+. The 
ideal outcome would be that the learning speed depends mostly on mo, with only 
a mild dependence on d. Almost all the results come under the assumption that 
the covariance matrix of the actions (A+)+ is well conditioned. 


The condition number of a positive definite matrix A is the ratio of its 
largest and smallest eigenvalues. A matrix is well conditioned if it has a 
small condition number. 


The details are a bit more complicated than just the conditioning, but the 
main point is that the usual assumptions imposed on the covariance matrix of 
the actions for passive learning are never satisfied when the actions are chosen 
by a good bandit policy. The reason is simple. Bandit algorithms want to choose 
the optimal action as often as possible, which means the covariance matrix will 
have an eigenvector that points (approximately) towards the optimal action with 
a large corresponding eigenvalue. We need some approach that does not rely on 
such strong assumptions. 


Elimination on the Hypercube 


As a warm-up, consider the case where the action set is the d-dimensional 
hypercube: A; = A = [—1,1]?. To reduce clutter, we denote the true parameter 
vector by 0 = @,. The hypercube is notable as an action set because it enjoys 
perfect separability. For each dimension i € [d], the value of An € [—1,1] can 
be chosen independently of Ay; for j 4 i. Because of this, the optimal action is 
a* = sign(0), where 


1, if 0; >0; 
sign(#); =sign(#;)= 40, if 6; =0; 
, if 6; <0. 
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So learning the optimal action amounts to learning the sign of 0; for each 
dimension. A disadvantage of this structure is that in the worst case the sign 
of each 6; must be learned independently, which in Chapter 24 we show leads 
to a worst-case regret of Rn = OQ(d\/n). On the positive side, the separability 
means that 0; can be estimated in each dimension independently while paying 
absolutely no price for this experimentation when 6; = 0. It turns out that this 
allows us to design a policy for which Rn = O(||@|loV/7), even without knowing 
the value of ||@]|o. 

Let Gi = 0(A1, X1,..., At, Xt) be the o-algebra containing information up to 
time t — 1 (this differs from F;, which also includes information about the action 
chosen). Now suppose that (A;;)¢_, are chosen to be conditionally independent 
given G,_1, and further assume for some specific i € [d] that Ay; is sampled from 
a Rademacher distribution so that P (An = 1| G:-1) = P (An = —1 | Gi-1) = 1/2. 
Then 


d 
Ati Xt | Geri] = E | Ati 5 Atj9; +7 | | Ge-1 
j=l 
= 0,E[A?, | Gea] + 5 6; E[At; Ati | Ge-1] + EAn | Ge-1] 


j+i 


where the first equality is the definition of X; = (6, Ay) +m, the second by linearity 
of expectation and the third by the conditional independence of (Ari); and the 
fact that E[A¢; | G:-1] = 0 and E[A?, | G1] = 1. This looks quite promising, but 
we should also check the variance. Using our assumptions that (m) is conditionally 
1-subgaussian and that (0,a) < 1 for all actions a, we have 


VAXi | Ge—1] = EJA? X? | Gi-1] — 0? = E[((0, At) + m4)? | Ge-1] — 6? < 2. 
(23.2) 


And now we have cause for celebration. The value of 9; can be estimated by 
choosing Ay; to be a Rademacher random variable independent of the choices in 
other dimensions. All the policy does is treat all dimensions independently. For a 
particular dimension (say 7), it explores by choosing A;; € {—1,1} uniformly at 
random until its estimate is sufficiently accurate to commit to either An = 1 or 
At; = —1 for all future rounds. How long this takes depends on |6;|, but note that 
if |9;| is small, then the price of exploring is also limited. The policy that results 
from this idea is called selective explore-then-commit (Algorithm 13, SETC). 


THEOREM 23.2. There exists a universal constants C,C’ > 0 such that the regret 
of SETC satisfies 


l 
E and Ra < 3O + C'lollo VaiogC). 


i:01#0 


By appealing to the central limit theorem and the variance calculation in 
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: Input nandd 

: Set Fy; = 1 and Cy; = R for all i € |d] 

: fort=1,...,n do 

For each i € |d] sample By; ~ RADEMACHER 
Choose action: 


Bii if 0 = Cii 
(vi) Au = <1 if Cy C (0, 00] 
—1 ifCy Cc [—oo, 0) ‘ 


6: Play A; and observe X; 
Construct empirical estimators: 


t t 
. A > =1 Ei Asi Xs 
(vi) T;(t) = > Ezi Qui = SSS 
s=1 Ti(t) 
8: Construct confidence intervals: 


(Vi) Wu = al (a + ria} log (nv2T:(0 +1) 


(Vi) Cray = g — We, Ôri + Wai 
9: Update exploration parameters: 


0 if0 ¢ Cray or Ei; =0 


1 otherwise. 


(Vi) Et+1,i = i 


10: end for 


Algorithm 13: Selective explore-then-commit. 


Eq. (23.2), we should be hopeful that the confidence intervals used by the 
algorithm are sufficiently large to contain the true 0; with high probability, but 
this still needs to be proven. 


LEMMA 23.3. Define 7 = n A max{t : En = 1}, and let F; = 110; € Cr+1,i} be 
the event that 0; is not in the confidence interval constructed at time Ti. Then 
P(F;) <1/n. 


The proof of Lemma 23.3 is left until after the proof of Theorem 23.2. 


Proof of Theorem 23.2 Recalling the definition of the regret and using the 
fact that the optimal action is a* = sign(0), we have the following regret 
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decomposition: 


Rr = max(9, a)—E Soe. At) 


t=1 


=~ (ne; -E 


i=l 


5 Atriði 
t=1 


Rni 


, (233) 


Clearly, if 6; = 0, then R,; = 0. And so it suffices to bound R,,; for each i 
with |6;| > 0. Suppose that |6;| > 0 for some i and the failure event F; given 
in Lemma 23.3 does not occur. Then 6; E€ C;,41,4, and by the definition of the 
algorithm, As; = sign(6;) for all t > 7;. Therefore, 


Rni = n0;| -E |X AiO; | = [Be a 4 Aasian) 
t=1 t=1 
< 2n6;|P (Fi) + |O:\E [I {FP} ri] - (23.4) 


Since 7; is the first round t when 0 ¢ C,41,; it follows that if F; does not occur, 
then 0; € Cri and 0 € Cri- Thus the width of the confidence interval C,, ; must 
be at least |0;|, and so 


Ti— 1 (7 — 1) 


1 1 
2W,,-14 = mi + :) log (nV/27; — 1) > |4i|, 


which after rearranging shows for some universal constant C > 0 that 


Combining this result with Eq. (23.4) leads to 


C1 

Fens < 2nl6s|P (F) + 10 + E., 
Using Lemma 23.3 to bound P(F;) and substituting into the decomposition 
Eq. (23.3) completes the proof of the first part. The second part is left as a treat 
for you (Exercise 23.2). 


Proof of Lemma 23.8 Let Si; = 30, Atj0j and Zn = Atih + AtiSti. For t < Ti, 


j+i 


j it <= 
B= O 
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The next step is to show that Zs; is conditionally J 2-subgaussian fort < Ti: 


i [exp(AZt) | Ge_1] = E [E [exp(AZu) | Ft—1] |Gt-1] 
= E [exp(AAni Sni )E [exp(AAnim) | Fe-1] | Ge-1] 


A2 
<E lexan Su) exp (=) lo. 


2 
= exp (=) i [ i [exp(A Ay Sti) |Gi-1, Sti] |Gr_1] 


2 202 
sew (7) [e (7) e=] 


< exp(à?). 


The first inequality used the fact that 7, is conditionally 1-subgaussian. The second- 
to-last inequality follows because A+; is conditionally Rademacher for t < Ti, 
which is 1-subgaussian by Hoeffding’s lemma (5.11). The final inequality follows 
because Sy < ||Az||ool|A|1 < 1. The result follows by applying the concentration 
bound from Exercise 20.8. 


Online to Confidence Set Conversion 


A new plan is needed to relax the assumption that the action set is a hypercube. 
The idea is to modify the ellipsoidal confidence set used in Chapter 19 to have a 
smaller radius. We will see that modifying the algorithm in Chapter 19 to use 
the smaller confidence intervals improves the regret to R, = O(./dpn log(n)). 


Without assumptions on the action set, one cannot hope to have a regret 
smaller than O(Vdn). To see this, recall that d-armed bandits can be 
represented as linear bandits with A; = {e1,...,eq}. For these problems, 
Theorem 15.2 shows that for any policy there exists a d-armed bandit for 
which R, = 2(/dn). Checking the proof reveals that when adapted to the 
linear setting the parameter vector is 2-sparse. 


The construction that follows makes use of a kind of duality between online 
prediction and confidence sets. While we will only apply the idea to the sparse 
linear case, the approach is generic. 

The prediction problem considered is online linear prediction under the 
squared loss. This is also known as online linear regression. The learner 
interacts with an environment in a sequential manner where in each round 
te Nt: 


1 The environment chooses X; € R and A; € R@ in an arbitrary fashion. 
2 The value of A; is revealed to the learner (but not X+). 
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3 The learner produces a real-valued prediction X; € R in some way. 
4 The environment reveals X; to the learner and the loss is (X; — X;)?. 


The regret of the learner relative to a linear predictor that uses the weights 
0 eR? is 


pn(0) = X (X -£ -Y(X - (0, A)’. (23.5) 


We say that the learner enjoys a regret guarantee B,, relative to © C R? if for 
any strategy of the environment, 
sup Pn(9) < Bn. (23.6) 
OEO 
The online learning literature has a number of powerful techniques for this 
learning problem. Later we will give a specific result for the sparse case when 
0 = {z : ||x||o < mo}, but first we show how to use such a learning algorithm 
to construct a confidence set. Take any learner for online linear regression, and 
assume the environment generates X; in a stochastic manner like in linear bandits: 


Xt = (Ox, At) + Nt - (23.7) 


Combining Eqs. (23.5) to (23.7) with elementary algebra, 


Qi = X (R; — (0x, At)? = pn(Ox) +2 95 (Xt — (0x, A) 


n 
< Bn t+ 25 me( Xe — (Ox, At), (23.8) 

t=1 
where the first equality serves as the definition of Q+. Let us now take stock for a 
moment. If we could somehow remove the dependence on the noise m in the right- 
hand side, then we could define a confidence set consisting of all 0 that satisfy 
the equation. Of course the noise has zero mean and is conditionally independent 
of its multiplier, so the expectation of this term is zero. The fluctuations can be 

controlled with high probability using a little concentration analysis. Let 


A= So ns(Xs = (Ox, As)) : 


Since X; is chosen based on information available at the beginning of the round, 
Xı is F,_1-measurable, and so 


for all XE R, vfexp(A(Z; — Zz_1)) | Fi_a] < exp(A20?/2), 


where o? = (Ñ, — (04,A;))?. The uniform self-normalised tail bound 
(Theorem 20.4) with À = 1 implies that, 


1 
P (ox t > 0 such that |Z;| > je + Q+) log ( 1) <6. 
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Provided this low-probability event does not occur, then from Eq. (23.8) we have 


Qi < Bit aja + Qt) log (- z“) , (23.9) 


While both sides depend on Q+, the left-hand side grows linearly, while the 
right-hand side grows sublinearly in Q+. This means that the largest value of Q; 
that satisfies the above inequality is finite. A tedious calculation then shows this 
value must be less than 


B,(5) = 1 + 2B, + 32log (ea ote) , 


(23.10) 


By piecing together the parts, we conclude that with probability at least 1 — ô 
the following holds for all t: 


Q = DE, — (9x, As))? < B,(5). 


We could define C;;; to be the set of all 0 such that the above holds with 6, 
replaced by 0, but there is one additionally subtlety, which is that the resulting 
confidence interval may be unbounded (think about the case that Sa A,A} is 
not invertible). In Chapter 19 we overcame this problem by regularising the least 
squares estimator. Since we have assumed that ||6.||2 < m2, the previous display 
implies that 


||. B+ Doce — (0x, As))? < m5 + B:(6) . 


All together, we have the following theorem: 


THEOREM 23.4. Let ô € (0,1) and assume that 6, € © and supgee pi(9) < Bi. If 


t 
Cia = fo ER’ : loll? +J ($: — (0, As)? < må + ao} 
s=1 


then P (exists t E N such that 6, Z C41) <6. 


The confidence set in Theorem 23.4 is not in the most convenient form. By 
defining V, = I+ >0{_, A.A! and S% = $t; A,X, and 6, = V'S, and 
performing an algebraic calculation that we leave to the reader (see Exercise 23.5), 
one can see that 


lB + So )? = ||0 — aR + EA, — (6;,A5))? + ||6:\[2. (23.11) 


Using this, the confidence set can be rewritten in the familiar form of an ellipsoid: 


t 


Ci = fo E R4 : |0 — ôli, < m3 + (8) — llê:ll2 - X (8: — Gane) . 


s=1 
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1: Input Online linear predictor and regret bound B;, confidence parameter 
ô € (0,1) 

2: for t= 1,...,n do 

3: Receive action set A; 

4: Computer confidence set: 


t—1 
G= fo ER: lol} + Y(R; — (0, As))? < m3 rw) 
è=1 


5: Calculate optimistic action 


A; = argmaX,c 4, Mmax(0, a) 
OEC: 


Feed A; to the online linear predictor and obtain prediction X; 
Play A; and receive reward X; 
Feed X; to online linear predictor as feedback 

end for 


Algorithm 14: Online linear predictor UCB (OLR-UCB). 


It is not obvious that C+ı is not empty because the radius could be negative. 
Theorem 23.4 shows, however, that with high probability 0. € Cz,1. At last we 
have established all the conditions required for Theorem 19.2, which implies the 
following theorem bounding the regret of Algorithm 14: 


THEOREM 23.5. With probability at least 1 — ô the pseudo-regret of OLR-UCB 
satisfies 


Ên < (84n (m3 + Bn-1(8)) log (1 + =) . 


Sparse Online Linear Prediction 


THEOREM 23.6. There exists a strategy n for the learner such that for any 
0 € R?, the regret pn(0) of m against any strategic environment such that 
max;e[n} || As|l2 < L and maxye in) |Xz| < X satisfies 


pa(B) < eX ||6llo {loge + n'/?L) + Cy log (1+ fal) } ++ X*)Cn, 
where c > 0 is some universal constant and Cp = 2 + logy log(e + n!/?L). 


Note that Cn = O(log log(n)), so by dropping the dependence on X and L, we 
have 


sup Pn(9) = O(mo log(n)) . 
6:||9|lo<mo,|16l|2<L 


As a final catch, the rewards (X;) in sparse linear bandits with subgaussian noise 
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are not necessarily bounded. However, the subgaussian property implies that with 
probability 1 — ô, |m| < log(2/d). By choosing ô = 1/n? and Assumption 23.1, 
we have 


Slr 


P (sua |X| > 1+ log (2r) ) < 
tEej|n 


Putting all the pieces together shows that the expected regret of OLR-UCB when 
using the predictor provided by Theorem 23.6 and when ||6||9 < mo satisfies 


R, =O (dng log(n)?) : 


Notes 


The strategy achieving the bound in Theorem 23.6 is not computationally 
efficient. In fact we do not know of any polynomial time algorithm with 
logarithmic regret for this problem. The consequence is that Algorithm 14 does 
not yet have an efficient implementation. 


While we focused on the sparse case, the results and techniques apply to other 
settings. For example, we can also get alternative confidence sets from results 
in online learning even for the standard non-sparse case. Or one may consider 
additional or different structural assumptions on 0. 


When the online linear regression results are applied, it is important to use the 
tightest possible, data-dependent regret bounds B,. In online learning most 
regret bounds start as tight, data-dependent bounds, which are then loosened 
to get further insight into the structure of problems. For our application, 
naturally one should use the tightest available regret bounds (or modify the 
existing proofs to get tighter data-dependent bounds). The gains from using 
data-dependent bounds can be significant. 


The confidence set used by Algorithm 14 depends on the sparsity parameter 
mo, which must be known in advance. No algorithm can enjoy a regret of 
O(\/||9x||odn) for all ||@.||o simultaneously (see Chapter 24). 


The bound in Theorem 23.5 still depends on the ambient dimension. In general 
this is unavoidable, as we show in Theorem 24.3. For this reason it recently 
became popular to study the contextual setting with changing actions and 
make assumptions on the distribution of the contexts so that techniques from 
high-dimensional statistics can be brought to bear. These approaches are still 
in their infancy and deciding on the right assumptions is a challenge. The 
reader is referred to the recent papers by Kim and Paik [2019] and Bastani 
and Bayati [2020]. 
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Interestingly, what they show is that the relationship goes in both directions: 
tail inequalities imply regret bounds, and regret bounds imply tail inequalities. 
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those in Section 23.3 have been used earlier in a series of papers by Claudio 
Gentile and friends [Dekel et al., 2010, 2012, Crammer and Gentile, 2013, Gentile 
and Orabona, 2012, 2014]. Carpentier and Munos [2012] consider a special case 
where the action set is the unit sphere and the noise is vector valued so that the 
reward is X; = (A;,9+ m). They prove bounds that essentially depend on the 
sparsity of 0 and E|||7,||3]. Our setting is recovered by choosing m to be a vector 
of independent standard Gaussian random variables, but in this case the bounds 
recovered by the proposed algorithm are suboptimal. 


Exercises 


23.1 (THE ZERO-‘NORM’) A norm on Rê is a function || - || : R — R such that 
for all a € R and x,y € RÎ, it holds that: (a) ||x|| = 0 if and only if x = 0 and 
(b) laa] = Jal||a|] and (c) |z + yl] < |la|| + |lyl] and (d) ||| > 0. Show that ||- |lo 
given by ||z||o = Ly I{x; #0} is not a norm. 


23.2 (MINIMAX BOUND FOR SETC) Prove the second part of Theorem 23.2. 


23.3 (ANYTIME ALGORITHM) Algorithm 13 is not anytime (it requires advance 
knowledge of the horizon). Design a modified version that does not require 
this knowledge and prove a comparable regret bound to what was given in 
Theorem 23.2. 


HINT One way is to use the doubling trick, but a more careful approach will 
lead to a more practical algorithm. 


23.4 Complete the calculation to derive Eq. (23.10) from Eq. (23.9). 


23.5 Prove the equality in Eq. (23.11). 
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Minimax Lower Bounds for 
Stochastic Linear Bandits 


Lower bounds for linear bandits turn out to be more nuanced than those for the 
classical finite-armed bandit. The difference is that for linear bandits the shape 
of the action set plays a role in the form of the regret, not just the distribution 
of the noise. This should not come as a big surprise because the stochastic 
finite-armed bandit problem can be modeled as a linear bandit with actions 
being the standard basis vectors, A = {e1,...,e,}. In this case the actions are 
orthogonal, which means that samples from one action do not give information 
about the rewards for other actions. Other action sets such as the unit ball 
(A = BI = {x € R¢: |lx|l2 < 1}) do not share this property. For example, if 
d= 2 and A= Bł and an algorithm chooses actions e; = (1,0) and eg = (0,1) 
many times, then it can deduce the reward it would obtain from choosing any 
other action. 

All results of this chapter have a worst-case flavour showing what is (not) 
achievable in general, or under a sparsity constraint, or if the realisable assumption 
is not satisfied. The analysis uses the information-theoretic tools introduced in 
Part IV combined with careful choices of action sets. The hard part is guessing 
what is the worst case, which is followed by simply turning the crank on the 
usual machinery. 

In all lower bounds, we use a simple model with Gaussian noise. For action 
At E€ A C R? the reward is X; = (At) + m where m ~ N (0,1) is a sequence of 
independent standard Gaussian noise and u : A > R is the mean reward. We 
will usually assume there exists a 0 € R? such that (a) = la, 0}. We write P, to 
indicate the measure on outcomes induced by the interaction of the fixed policy 
and the Gaussian bandit paramterised by u. Because we are now proving lower 
bounds, it becomes necessary to be explicit about the dependence of the regret 
on A and p or 0. The regret of a policy is: 


RA, u) = n max (a) - E, 2 x. ; 


where the expectation is taken with respect to P,,. Except in Section 24.4, we 
assume the reward function is linear, which means there exists a 0 € R? such 
that (a) = (a, 6). In these cases, we write R,(A,9) and Eg and Py». Recall the 
notation used for finite-armed bandits by defining T,(t) = $t; {As = £}. 
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Hypercube 


The first lower bound is for the hypercube action set and shows that the upper 
bounds in Chapter 19 cannot be improved in general. 


THEOREM 24.1. Let A = [-1,1]? and © = {—n-/?,n-/?14, Then, for any 
policy, there exists a vector 0 € O such that: 


R,(A, 9) > wai. 


Proof By the relative entropy identities in Exercise 15.8.(b) and Exercise 14.7, 
we have for 6,6’ € © that 


D(Po, Po) = Ee 


; bg [(At,0 — 0°]. (24.1) 


For i € |d] and 0 € O, define 


poi = Pa ($O Hsien) # sien(0:)) = 0/2). 
t=1 
Now let i € [d] and 6 € O be fixed, and let 6 = 0; for j # i and 0; = —6;. Then, 
by the Bretagnolle-Huber inequality (Theorem 14.2) and Eq. (24.1), 


n 


1 1 1 
Poi + po'i = 500 ( 5 5 aano =o] > z P (—2) : (24.2) 


t=1 


Applying an ‘averaging hammer’ over all 6 € ©, which satisfies |O| = 2¢, we get 


Tee a Lene fowl —2). 


EO i=10€0 


This implies that there exists a 0 € © such that om poi = dexp(—2) /4. By 
the definition of poi, the regret for this choice of 6 is at least 


n d 
R,(A, 0) = Eo pa > (sign(6:) — Ari)b; 


6 Ss I{sign(Ay) 4 sene) 


t=1 


a 3 
ii Ma 


Po (S51 ita) # sign(6;)} > v/a 


t=1 


d 
NES pays avi, 
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where the first line follows since the optimal action satisfies aj = sign(0;) for 
i € |d], the first inequality follows from a simple case-based analysis showing that 
(sign(@;) — Ati)0i > |0;|I {sign(At:) 4 sign(@;)}, the second inequality is Markov’s 
inequality (see Lemma 5.1), and the last inequality follows from the choice of 
0. 


Except for logarithmic factors, this shows that the algorithm of Chapter 19 
is near optimal for this action set. The same proof works when A = {—1,1}4 
is restricted to the corners of the hypercube, which is a finite-armed linear 
bandit. In Chapter 22, we gave a policy with regret Rn = O(,/ndlog(nk)), 
where k = |A|. There is no contradiction because the action set in the above 
proof has k = |A| = 2% elements. 


Unit Ball 


Lower-bounding the minimax regret when the action set is the unit ball presents 
an additional challenge relative to the hypercube. The product structure of the 
hypercube means that the actions of the learner in one dimension do not constrain 
their choices in other dimensions. For the unit ball, this is not true, and this 
complicates the analysis. Nevertheless, a small modification of the technique 
allows us to prove a similar bound. 


THEOREM 24.2. Assume d < 2n and let A = {x € R? : |æll2 < 1}. Then 
there exists a parameter vector 0 € R? with |0|} = d?/(48n) such that 


Ra (A, 0) > dy/n/(16V3). 
Proof Let A = qavd/n and 0 € {+A}! and for i € [d], define r; = n \min{t : 
5t A2, > n/d}. Then, 


st — 


R,,(A, 0) = AE, > (=; — At sen(0)) 
PaE (Jaa) 
z iy 6 (= — Ag sa) | ; 


where the first inequality uses that || A|| < 1. Fix i € [d]. For x € {+1}, define 
Ui(a) = S37, (A/V d — Aur)? and let 6’ € {4A}? be another parameter vector 
such that 6; = 6; for j # i and 0; = —6;. Assume without loss of generality that 
6; > 0. Let P and P’ be the laws of U;(1) with respect to the bandit /learner 
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interaction measure induced by 0 and 6’, respectively. Then, 


to[Ui(1)] > Eo [Us(1)] - (+ +2) Í D(P, P’) 
> Ey [U;(1)] — s (F + 2) (24.3) 
> Ep [U;(1)] - = (4 J 2) (24.4) 
> Ep [U;(1)] — A (24.5) 


where in the first inequality we used Pinsker’s inequality (Eq. (14.12)), the result 
in Exercise 14.4, the bound 

= “1 = 4n 

= .\2 = 2 mai 
Ui(l) = SO0/Vd — Aw)? $2575 +250 ARS > 42, 

t=1 t=1 t=1 
and the assumption that d < 2n. The inequality in Eq. (24.3) follows from the 
chain rule for the relative entropy up to a stopping time (Exercise 15.7). Eq. (24.4) 
is true by the definition of 7; and Eq. (24.5) by the assumption that d < 2n. 
Then, 


o [Ui(1)] + Eo[Ui(—1)] > Eo-[Ui(1) + Ui(-1)] ama 


4V/3nA n, 2n AV/3nA n n 
d d` d d a <a 


The proof is completed using the randomisation hammer: 


d 
L RANSE E EUe) 


d 
-AIS SD DEE) 
2 


i=1 6_,E{+A}4—-! 0c {+4} 


d 
>avMfy~ 5o Ba *nava. 


i=1 6_,E{+A}4-1 


nAvd _ dyn 
4 16/3" 


Hence there exists a 0 € {ŁA}? such that Rna (A, 0) > 


The same proof works when A = {x € R° : ||x||2 = 1} is the unit sphere. In 
fact, given a set X C R, a minimax lower bound that holds for A = co(X) 
continues to hold for A = X. 
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Sparse Parameter Vectors 


In Chapter 23 we gave an algorithm with R, = O(./dpn) where p > ||êllo is a 
known bound on the sparsity of the unknown parameter. Except for logarithmic 
terms this bound cannot be improved. An extreme case is when p = 1, which 
essentially reduces to the finite-armed bandit problem where the minimax regret 
has order Vdn (see Chapter 15). For this reason we cannot expect too much from 
sparsity and in particular the worst-case bound will depend polynomially on the 
ambient dimension d. 

Constructing a lower bound for p > 1 is relatively straightforward. For simplicity 
we assume that d = pk for some integer k > 1. A sparse linear bandit can mimic 
the learner playing p finite-armed bandits simultaneously, each with k arms. 
Rather than observing the reward for each bandit, however, the learner only 
observes the sum of the rewards and the noise is added at the end. This is 
sometimes called the multi-task bandit problem. 


THEOREM 24.3. Assume pd < n and that d = pk for some integer k > 2. Let 
A = fe; E R! : 7 € [k]}” c R. Then, for any policy there exists a parameter 
vector 0 € R? with ||9\|o = p and ||O|l0 < /d/(pn) such that R,(A,0) > gVpdn. 


Proof Let A >0 and © = {Ae; : i € [k]} C R*. Given 0 € ©? C R? and i € [p], 
let 0 € R* be defined by oi = 9G—-1)p+k, Which means that 


AT = [ADT 6T... 0T]. 


Next define matrix V € RP*4 to be a block-diagonal matrix with 1 x k blocks, 


each containing the row vector (1,2,...,k). For example, when p = 3, we have 
1 k 0 0 0 0 
V= j0 0 1 k 0 0 
0 0 0 0 1 k 


Let Bi = VA; € [k]? represent the vector of ‘base’ actions chosen by the learner 
in each of the p bandits in round t. The optimal action in the ith bandit is 


b; (0) = argmaxye iq] a : 


The regret can be decomposed into the regrets in the p ‘base bandit’ problems (a 
form of separability, again): 


Ry (0) =J AEs yore, z | | 


Rni(O) 
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dcOP i=1 0€OP 
Pp 
1 1 

= TEN Rn:l0 
Lor 2 jo (8) 

4=1 6(-*)E€@rp-1 MEO 

1 1 
25 Sepa > Vin (24.6) 

i=1 6(-*)€@p-1 

1 


1 
= pVin = tyä. 


Here, in the second equality, we use the convention that 0 denotes the vector 
obtained by ‘inserting’ @ into 6-9 at the ith ‘block’. Other than this, the only 
tricky step is the inequality, which follows by choosing A = Jk/n and repeating 
the argument outlined in Exercise 15.2. We leave it to the reader to check the 
details (Exercise 24.1). 


Misspecified Models 


An important generalisation of the linear model is the misspecified case, where 
the mean rewards are not assumed to follow a linear model exactly. Suppose 
that A C R is a finite set with |A| = k and that X, = m + w(Az), where 
u: A R is an unknown function. Let 0 € R? be the parameter vector for which 
supac a |(9, a) — u(a)| is as small as possible: 


9 = argmingege sup |(a, a) — p(a)| . 
acA 


Then let € = supe, |(0,a) — u(a)| be the maximum error. It would be very 
pleasant to have an algorithm such that 


= O(min{d/n + en, Vkn}). (24.7) 


Rp (A, p) = nmax (a) — E y p( Ar) 


Unfortunately, it turns out that results of this kind are not achievable. To show 
this, we will prove a generic bound for the classical finite-armed bandit problem 
and afterwards show how this implies the impossibility of an adaptive bound like 
the above. 


THEOREM 24.4. Let A = |k], and for p € [0,1]* the reward is Xi = wa, +m and 
the regret is 


nm 
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Define 9,0’ C RË by 
© = {we [0,1]* : mi =0 fori > 1} ©' = {u € [0,1]*}. 


If V €R is such that 2(k—1) < V < yn(k — 1) exp(—2)/8 and sup,,co Rn(u) < 
V, then 


wnae 


exp(—2). 
peo’ 8V ( ) 


Proof Recall that T;(n) = X; I{4 = i} is the number of times arm i is 
played after all n rounds. Let u € © be given by py = A = (k — 1)/V < 1/2. The 
regret is then decomposed as: 


> 
2s 
= 
l 
> 
= 
S 
= 
IA 
< 


Rearranging shows that S’ E [T;(n)] < K, and so by the pigeonhole principle 
there exists an i > 1 such that 


V 1 
T; < = 
Then, define u’ € ©’ by 
A ifj=1 


w= QA ifj=i 
0 otherwise . 


Next, by Theorem 14.2 and Lemma 15.1, for any event A, we have 


1 
P,(A) +Py(A°) > 5 


By choosing A = {Tı (n) < n/2} we have 


xp (D(Py,Py)) = 5 exp (~24?E[Ti(n)]) > 5 exp (—2) . 


A —1 
Ra (u) + Ru(q!) > TË exp(—2) = EZD exp(—a). 
Therefore, by the assumption that Rp (u) < V < \/n(k — 1) exp(—2)/8 we have 
k-1 
none PED eal: 


8V 

As promised, we now relate this to the misspecified linear bandits. Suppose 
that d = 1 (an absurd case) and that there are k arms A = {a1,a2,... ak} C Rt, 
where a; = (1) and a; = (0) for i > 1. Clearly, if @ > 0 and p(a;) = (ai, 0), then 
the problem can be modelled as a finite-armed bandit with means u € © C [0,1]*®. 
In the general case, we just have a finite-armed bandit with u € ©’. If in the first 
case we have R,(A, u) = O(./n), then the theorem shows for large enough n that 

sup R,(A, u) = (kyn). 


HEO’ 
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It follows that Eq. (24.7) is a pipe dream. To our knowledge, it is still an open 
question of what is possible on this front. We speculate that for k > d?, there is 
a policy for which 


R,,(A,0) =O (nin {avin envd, iit) 


Notes 


jai 


The worst-case bound demonstrates the near optimality of the OFUL algorithm 
for a specific action set. It is an open question to characterise the optimal 
regret for a wide range of action sets. We will return to these issues in the next 
part of the book, where we discuss adversarial linear bandits. 

We return to misspecified bandits in the notes and exercises of Chapter 29, 
where algorithms from the adversarial linear bandit framework are applied to 
this problem in special cases. In many applications, the number of actions is so 
large that Rn = O(d\/n + enVd) should be considered acceptable. There exist 
algorithms achieving this bound, which for large k is essentially not improvable 
in the worst case [Lattimore and Szepesvari, 2019b]. For small k, recent work by 
Foster and Rakhlin [2020] shows that one can achieve R, = O(Vdkn + envk). 
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Exercises 


24.1 Complete the missing steps to prove the inequality in Eq. (24.6). 
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Asymptotic Lower Bounds for 
Stochastic Linear Bandits 


The lower bounds in the previous chapter were derived by analysing the worst 
case for specific action sets and/or constraints on the unknown parameter. In this 
chapter, we focus on the asymptotics and aim to understand the influence of the 
action set on the regret. We start with a lower bound, and argue that the lower 
bound can be achieved. We finish by arguing that the optimistic algorithms (and 
Thompson sampling) will perform arbitrarily worse than what can be achieved 
by non-optimistic algorithms. 


An Asymptotic Lower Bound for Fixed Action Sets 


We assume that A C R? is finite with |A| = k and that the reward is 
Xt = (Ar, 9) +m, where 0 € R? and (m)% 1 is a sequence of independent 
standard Gaussian random variables. Of course the regret of a policy in this 
setting is 


Rn (A, 0) = Eo 


F Aa = max(a’ — a, 0), 
A 


a'€ 


2 Aa 
t=1 


where the dependence on the policy is omitted for readability and E,|-] is the 
expectation with respect to the measure on outcomes induced by the interaction 
of the policy and the linear bandit determined by 0. Like the asymptotic lower 
bounds in the classical finite-armed case (Chapter 16), the results of this chapter 
are proven only for consistent policies. Recall that a policy is consistent in some 
class of bandits € if the regret is sub-polynomial for any bandit in that class. 
Here this means that 


R,(A,0) =o0(n?) for allp>O and 6€R®. (25.1) 


The main objective of the chapter is to prove the following theorem on the 
behaviour of any consistent policy and discuss the implications. 


THEOREM 25.1. Assume that A C R? is finite and spans Rt, and suppose a 
policy is consistent (satisfies Eq. 25.1). Let 0 € RÊ be any parameter such 


that there is a unique optimal action, and let Gn = Eg [ 7 A,A; |. Then 
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lim infy3oo Amin(Gn)/log(n) > 0. Furthermore, for any a € A, it holds that 


A2 


lim sup log(n)|Ja||4_1 < 
msuplog(n)}|lalz.-1 < 5 


The reader should recognise lallġ-: as the key term in the width of the 
confidence interval for the least squares estimator (Chapter 20). This is quite 
intuitive. The theorem is saying that any consistent algorithm must prove 
statistically that all suboptimal arms are indeed suboptimal by making the 
size of the confidence interval smaller than the suboptimality gap. Before the 
proof of this result, we give a corollary that characterises the asymptotic regret 
that must be endured by any consistent policy. 


COROLLARY 25.2. Let A C R? be a finite set that spans R? and 0 € R® be such 
that there is a unique optimal action. Then, for any consistent policy, 
R,(A, 6) 


where c(.A,0) is defined as 


c(A,0)= inf X` a(a)A, 


A 
a€[0,co) GEA 


A2 
subject to llallĝ;-: < a for alla € A with Aa > 0, 


with Ha = Daca, a(ajaa’. 


The lower bound is complemented by a matching upper bound that we will 
not prove. 


THEOREM 25.3. Let A C R? be a finite set that spans RÌ. Then there exists a 
policy such that 


. Rn (A, 8) 
lim sup —— 


< c(A, 6), 


where c(.A,0) is defined as in Corollary 25.2. 


Proof of Theorem 25.1 The proof of the first part is simply omitted (see the 
reference below for details). It follows along similar lines to what follows, essentially 
that if Gn is not sufficiently large in every direction, then some alternative 
parameter is not sufficiently identifiable. Let a* = argmax,¢ 4 (a, 0) be the optimal 
action, which we assumed to be unique. Let 6’ € R be an alternative parameter 
to be chosen subsequently, and let P and P’ be the measures on the sequence 
of outcomes Ay, X1,..., An, Xn induced by the interaction between the policy 
and the bandit determined by 6 and 0’ respectively. Let E[-] and E’[-] be the 
expectation operators of P and P’, respectively. By Theorem 14.2 and Lemma 15.1, 
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for any event E, 


P (E) +P'(E°) > 5 exp(-D(P,P’)) 


2 1 1 
12. (At, 0 — 6’) *) = 5 exP (-310- #12, ) 


(25.2) 


l 
N| = 
© 
tal 
Ke) 
| 

NI 


A simple re-arrangement shows that 


12 
zl -8 lla, 2 log (zE TPE) l 


Now we follow the usual plan of choosing 6’ to be close to 0, but so that the 
optimal action in the bandit determined by 6’ is not a*. Let Amin = min{A, 
a € A,Aq > 0} and e € (0, Amin) and H be a positive definite matrix to be 
chosen later such that ||a — a*||?, > 0. Then define 


Aa +E 
Z *]|2 (a a*), 
H 


0 =04 


lla—a 
which is chosen so that 
(a—a*,0') = (a — a*, 0) + Aa +E =E. 


This means that a* is e-suboptimal for bandit 6’. We abbreviate Rn = Rn (A, 6) 
and Ri, = Rn(A, 6’). Then 


So Ta(njA 


acA 


where T(n) = X; I {4 = a}. Similarly, a* is e-suboptimal in bandit 6’ so that 


> nAi (r) < n/2) > =P (Ta+ (n) < n/2) , 


Rn =E 


Therefore, 
2 
P (Tar (n) < n/2) +P’ (Ta (n) > n/2) < T (Rn + Ri). (25.3) 


Note that this holds for any choice of H with ||a — a*|| m > 0. The logical next 

step is to select H (which determines 6’) to make (25.2) as large as possible. The 

main difficulty is that this depends on n, so instead we aim to choose an H so 

the quantity is large enough infinitely often. We start by just re-arranging things: 
(Aa +€)? la- allira, H (Aa +£)? 


1 
zle- ll, = : = zz Pn (FZ) 
2 ne 2 la- a*i — 2lla - a*l- 


where we introduced 


q “lee. H 
"a 


le- a" 2 alla 


lla—a 
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Therefore, by choosing E to be the event that Ta» (n) < n/2 and using (25.3) and 
(25.2), we have 
(Aa + e)? 
Ija- a" [ 


NE 
A) > log ( —— 
pn(H) > o (mar) 


which after re-arrangement leads to 
Pn(H) 21 


(Aa +£)? log((4Rn + 4R!,)/£) 
2 log(n)||a — a*|[&-1 E log(n) l 


The definition of consistency means that R, and Ri, are both sub-polynomial, 
which implies that the second term in the previous expression tends to zero for 
large n and so by sending £ to zero, 


pn(H) 2 
a 


lim inf z 
Ga a 


noo log(n)||a — a*|| 


(25.4) 


We complete the result using proof by contradiction. Suppose that 


A2 
lim sup log(n)||a — a*||-1 >. (25.5) 
n> o0 i 2 
Then there exists an £ > 0 and infinite set S C N such that 


a (Ag +e)? 
2 


log(n)||a — a* || -1 forallne s. 


Hence, by (25.4), liminfneg pn(H) > 1. We now choose H to be a cluster point of 
the sequence (Gz1/|G5"\|)nes where ||G5"|| is the spectral norm of the matrix 
G71. Such a point must exist, since matrices in this sequence have unit spectral 
norm by definition and the set of such matrices is compact. We let S’ C S be 
a subset so that G>!/||Gz"|| converges to H on n € S$’. We now check that 
|a — a*||y > 0: 

la- at 


a— a*l% = lim ——— ; 
Ja ay = Jim, a 
where the last inequality follows from the assumption in (25.5) and the first part 
of the theorem. Therefore, 


2 


la- all- lla- alia, u 


=1 


? 


1 < liminf pn(H) < lim inf 
SA S la— eig 


which is a contradiction, and hence (25.5) does not hold. Thus, 
Aa 


“lg- Sz: 


lim sup log(n)||a — a 
n+ oco 


We leave the proof of the corollary as an exercise for the reader. Essentially, 
though, any consistent algorithm must choose its actions so that in expectation 


2 


A 
_ g* 2 i < a . 
la- a*l: SA +o at 
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Now, since a* will be chosen linearly, often it is easily shown for suboptimal a 
that limpo ||a — a*||g-1/|la||g-1 + 1. This leads to the required constraint on 
the actions of the algorithm, and the optimisation problem in the corollary is 
derived by minimising the regret subject to this constraint. 


Clouds Looming for Optimism 


The theorem and its corollary have disturbing 
implications for policies based on the principle 
of optimism in the face of uncertainty, which 
is that they can never be asymptotically 
optimal. The reason is that these policies 
do not choose actions for which they have 
collected enough statistics to prove they are 
suboptimal, but in the linear setting it can 
be worth playing these actions when they 
are very informative about other actions for 
which the statistics are not yet so clear. As 
we shall see, a problematic example appears 
in the simplest case where there is information sharing between the arms. Namely, 
when the dimension is d = 2, and there are k = 3 arms. 


Let A = {a1,a2,a3}, where a, = e; and ag = eg and a3 = (1 —¢,7e) with 
y >1 and € > 0 is small. Let 6 = (1,0) so that the optimal action is a* = a, 
and Aa, = 1 and Aa; = €. If £ is very small, then a) and ag point in nearly 
the same direction, and so choosing only these arms does not provide sufficient 
information to quickly learn which of a; or ag is optimal. On the other hand, 
a2 and a, — a3 point in very different directions, which means that choosing a2 
allows a learning agent to quickly identify that a, is in fact optimal. We now 
show how the theorem and corollary demonstrate this. First we calculate the 
optimal solution to the optimisation problem in Corollary 25.2. Recall we are 
trying to minimise 


A2 
5 ala)Aa subject to allira- < os for alla € A with A, >0, 
acA 


where H(a) = X „e4 ala)aa" . Clearly we should choose a(a1) arbitrarily large, 
then a computation shows that 


0 0 


a(aı)=> 1 
i 0 a(az)e?y?+a(a2) 
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The constraints mean that 


: m laale <5 
= 1m a = p 
a(a3)e?7? Ea a(az) a(aı)—=> o0 ?2H(a)=! = 2 
222 2 
TE . 2 2 
= l iA —s 
ala Tala) T alan} rae Slaa S 


Provided that y > 1, this reduces to the constraint that 
a(az)e? + a(az) > 29°. 


Since we are minimising a(az) + ca(a3) we can easily see that a(az) = 27? and 
a(a3) = 0 provided that 27? < 2/e. Therefore, if e is chosen sufficiently small 
relative to y, then the optimal rate of the regret is c(A,@) = 277, and so by 
Theorem 25.3 there exists a policy such that 


Now we argue that for y sufficiently large and e€ arbitrarily small that the regret 
for any consistent optimistic algorithm is at least 
Rn (A, 6 
lim sup Baat) = Q(1/e), 
which can be arbitrarily worse than the optimal rate! So why is this so? Recall 
that optimistic algorithms choose 


A, = argmax,. 4 max (a, 6) ; 
OEC: 


where C; C R? is a confidence set that we assume contains the true 0 with high 
probability. So far this does not greatly restrict the class of algorithms that we 
might call optimistic. We now assume that there exists a constant c > 0 such 
that 


Ci C {8: ê; — Ôliv, < eviog(n)} , 


where V; = Sé A,A!. So now we ask how often we can expect the optimistic 
algorithm to choose action az = e2 in the example described above. Since we 
have assumed 6 € C; with high probability, we have that 


max(a1,0) >1. 
OEC: 


On the other hand, if Ta, (t — 1) > 4c? log(n), then 


log(n) 


max (a2, 0) = max(az,0 — 0) < 2c Ila2|ly—2 log(n) < 2c T,,(t 1) 


OEC: OEC: 


<1, 


which means that az will not be chosen more than 1 + 4c? log(n) times. So if 
y = Q(c?), then the optimistic algorithm will not choose az sufficiently often 
and a simple computation shows it must choose a3 at least Q(log(n) /e?) times 
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25.4 
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and suffers regret of Q(log(n)/e). The key take away from this is that optimistic 
algorithms do not choose actions that are statistically suboptimal, but for linear 
bandits it can be optimal to choose these actions more often to gain information 
about other actions. 


This conclusion generalises to structured bandit problems where choosing 
one action allows you to gain information about the rewards of other actions. 
In such models the optimism principle often provides basic guarantees, but 
may fail to optimally exploit the structure of the problem. 


Notes 


1 All algorithms known to match the lower bound in Theorem 25.3 are based 
on (or inspired by) solving the optimisation problem that defines c( A, 0) with 
estimated value 0. Unfortunately, these algorithms are not especially practical 
in finite time. As far as we know, none are simultaneously near-optimal in a 
minimax sense. Constructing a practical asymptotically optimal algorithm for 
linear bandits is a fascinating open problem. 

2 In Chapter 36 we will introduce the randomised Bayesian algorithm called 
Thompson sampling algorithm for finite-armed and linear bandits. While 
Thompson sampling is often empirically superior to UCB, it does not overcome 
the issues described here. 
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The theorems of this chapter are by the authors: Lattimore and Szepesvari [2017]. 
The example in Section 25.2 first appeared in a paper by Soare et al. [2014], 
which deals with the problem of best-arm identification for linear bandits (for an 
introduction to best-arm identification, see Chapter 33). The optimisation-based 
algorithms that match the lower bound are by Lattimore and Szepesvari [2017], 
Ok et al. [2018], Combes et al. [2017] and Hao et al. [2020], with the latter 
handling also the contextual case with finitely many contexts. 


Exercises 


25.1 Prove Corollary 25.2. 


25.2 Prove the first part of Theorem 25.1. 
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25.3 Give examples of action sets A, parameter vectors 0 € R? and vectors 
a € R? such that: 

(a) c(AU {a}, @) > c(A, 0); and 

(b) c(AU {a}, 6) < c(A, 6). 


Part VI 
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The adversarial linear bandit is superficially a generalisation of the stochastic 
linear bandit where the unknown parameter vector is chosen by an adversary. 
There are many similarities between the two topics. Indeed, the techniques in 
this part combine the ideas of optimal design presented in Chapter 22 with 
the exponential weighting algorithm of Chapter 11. The intuitions gained by 
studying stochastic bandits should not be taken too seriously, however. There are 
subtle differences between the model of adversarial bandits introduced here and 
the stochastic linear bandits examined in previous chapters. These differences 
will be discussed at length in Chapter 29. The adversarial version of the linear 
bandits turns out to be remarkably rich, both because of the complex information 
structure and because of the challenging computational issues. 

The part is split into four chapters, the first of which is an introduction to the 
necessary tools from convex analysis and optimisation. In the first chapter on 
bandits, we show how to combine the core ideas of the Exp3 policy of Chapter 11 
with the optimal experimental design for least-squares estimators in Chapter 21. 
When the number of actions is large (or infinite), the approach based on Exp3 
is hard to make efficient. These shortcomings are addressed in the next chapter, 
where we introduce the mirror descent and follow-the-regularised leader algorithms 
for bandits and show how they can be used to design efficient algorithms. We 
conclude the part with a discussion on the relationship between adversarial and 
stochastic linear bandits, which is more subtle than the situation with finite-armed 
bandits. 
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26.1 


ies 


Foundations of Convex Analysis (4) 


Our coverage of convex analysis is necessarily extremely brief. We introduce only 
what is necessary and refer the reader to standard texts for the proofs. 


Convex Sets and Functions 


A set A C R? is convex if for any x,y € A it holds that ax + (1—a)y € A for all 
a € (0,1). The convex hull of a collection of points 71, £2,..., £n E R® is the 
smallest convex set containing the points, which also happens to satisfy 


n 
co(z1, £2, ..., En) = fa ER: r= X pizi for some p € Pai} ; 


i=1 


The convex hull co(A) is also defined for an arbitrary set A C R? and is still the 
smallest convex set that contains A (see (c) in Figure 26.1). For the rest of the 
section, we let A C R? be convex. Let R = RU {—00, 00} be the extended real 
number system and define operations involving infinities in the natural way (see 
notes). 


DEFINITION 26.1. An extended real-valued function f : R? > R is convex if its 
epigraph Ey = {(z,y) E€ R? x R : y > f(x)} C R”! is a convex set. 


The term ‘epi’ originates in greek and it means upon or over: The epigraph of 
a function is the set of points that sit on the top of the function’s graph. 

The domain of an extended real-valued function on Rt is dom(f) = {x € R° : 
f(x) < oo}. For S C R4, a function f : S > R is identified with the function 
f : R? > R, which coincides with f on S and is defined to take the value oo 
outside of S. It follows that if f : S + R, then dom(f) = S. A convex function is 
proper if its range does not include —oo and its domain is nonempty. 


For the rest of the chapter, we will write ‘let f be a convex’ to mean that 
f : R? > R is a proper convex function. 


= 
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(c) 
(f) 


Figure 26.1 (a) is a convex set. (b) is a non-convex set. (c) is the convex hull of a 
non-convex set. (d) is a convex function. (e) is non-convex, but all local minimums are 
global. (f) is not convex. 


(a) (b) 
(a) (e) 


Permitting convex functions to take values of —oo is a convenient standard 
because certain operations on proper convex functions result in improper 
ones (infimal convolution, for example). These technicalities will never bother 
us in this book, however. 


A consequence of the definition is that for convex f, we have 


flax + (1—a)y) < af(x) + (1 — a) f(y) 
for all a € (0,1) and x,y € dom(f). (26.1) 


In fact, the inequality holds for all x, y € RÊ. 


Some authors use Eq. (26.1) as the definition of a convex function along 
with a specification that the domain is convex: If A C Rf is convex, then 
f : A —> R is convex if it satisfies Eq. (26.1), with f(x) = co assumed for 
z éA. 


The reader is invited to prove that all convex functions are continuous on the 
interior of their domain (Exercise 26.1). 

A function is strictly convex if the inequality in Eq. (26.1) is always strict. 
The Fenchel dual of a function f is f*(u) = sup, (x,u) — f(x), which is convex 
because the maximum of convex functions is convex. The Fenchel dual has many 
nice properties. Most important for us is that for sufficiently nice functions, V f* 
is the inverse of Vf (Theorem 26.6). Another useful property is that when f is 
a proper convex function and its epigraph is closed, then f = f**, where f** 
denotes the bidual of f: f** = (f*)*. The Fenchel dual is also called the convex 
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conjugate. If f : R¢ > R is twice differentiable on the interior of its domain, then 
convexity of f is equivalent to its Hessian having non-negative eigenvalues for 
all x € int(dom(f)). The field of optimisation is obsessed with convex functions 
because all local minimums are global (see Fig. 26.1). This means that minimising 
a convex function is usually possible (efficiently) using some variation of gradient 
descent. A function f : R? + R is concave if —f is convex. 


Jensen’s Inequality 


One of the most important results for convex functions is Jensen’s inequality: 


THEOREM 26.2 (Jensen’s inequality). Let f : R? > R be a measurable conver 
function and X be an R4-valued random element on some probability space such 
that E[X] exists and X € dom(f) holds almost surely. Then E[f(X)] > f(E[X]). 


If we allowed Lebesgue integrals to take on the value of oo, the condition that 
X is almost surely an element of the domain of f could be removed and the 
result would still be true. Indeed, in this case we would immediately conclude 
that E[f(X)] = co and Jensen’s inequality would trivially hold. 

The basic inequality of (26.1) is trivially 
a special case of Jensen’s inequality. Jensen’s 


inequality is so central to convexity that it 
can actually be used as the definition (a 
function is convex if and only if it satisfies 
Jensen’s inequality). The proof of Jensen’s using 
Definition 26.1 in full generality is left to the 
reader (Exercise 26.2). However, we cannot resist ~— (2, f(#)) 


to include here a simple ‘graphical proof’ that l l — 
Tı vg T3 U4 T5 


works in the simple case when X is supported 

on £1, ..., En and P(X = zk) = pp. Then, letting Z = >), pete, one can notice 
that the point (z, $2, prz) lies in the convex hull of {(x, f(£k))k}, which is a 
convex subset of the epigraph Ey C R¢+!. The result follows because (z, f(Z)) is 
on the boundary of Ep as shown in the figure. The direction of Jensen’s inequality 
is reversed if ‘convex’ is replaced by ‘concave’. 


Bregman Divergence 


Let f : R? — R be convex and z,y € R? with y € dom(f). The Bregman 
divergence at y induced by f is defined by 


D;(z,y) = f(z) — fly) — Va-yf(y) 


where Vs f(y) = limp.0+(f(y + hv) — f(y))/h E€ RU {—00, oo} is the directional 
derivative of f at y in direction v. The directional derivative is always well defined 
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— f(x) 
--- f(y) + (x-y, Vf(y)) 


Figure 26.2 The Bregman divergence Dș(x, y) is the difference between f(x) and the 
Taylor series approximation of f at y. When f is convex the, linear approximation is a 
lower bound on the function and the Bregman divergence is positive. 


for convex functions, but can be positive/negative infinity. When f is differentiable 
at y, then V, f(y) = (v, Vf(y)) and thus D(x, y) = f(x) — f(y) — (z -y, VF(y)), 
which is the more usual definition. For the geometric intuition see Fig. 26.2. Let 
dom(V f) denote the set of points in the domain of f where f is differentiable. 


THEOREM 26.3. The following hold: 


(a) Dy(x,y) > 0 for all y € dom(f). 
(b) Dy(x,x) =0 for all x € dom(f). 
(c) Dy(x,y) is convex as a function of x for any y € dom(V f). 


Part (c) does not hold in general when f is not differentiable at y, as you will 
show in Exercise 26.14. The square root of the Bregman divergence shares many 
properties with a metric, and for some choices of f, it actually is a metric. In 
general, however, it is not symmetric and does not satisfy the triangle inequality. 


EXAMPLE 26.4. Let f(x) = $||2||3. Then V f(x) = x and 


1 1 1 
Ds(x,y) = sllella — sllylla — (@ -y,9) = sila = yll- 
2 2 2 


EXAMPLE 26.5. Let A = [0,00)4, dom(f) = A and for xz € A, f(x) = 
E2 (a log(a;) — xj), where O0log(0) = 0. Then, for y € (0,00)%, V f(y) = log(y) 
and 

d 


d d 
Dy(2,y) = X (ai log(ai) — xi) — X (yi log yi — ys) — X log(y:) (z: — ys) 
i=1 i= 


i=1 
d 

= Yao (2) +50- 

i=1 

Notice that if x,y € Pa—ı are in the unit simplex, then D(x, y) is the relative 


entropy between probability vectors x and y. The function f is called the 
unnormalised negentropy, which will feature heavily in many of the chapters 
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that follow. When y % 0, the Bregman divergence is infinite if there exists an 
i such that y; = 0 and x; > 0. Otherwise, D;(2,y) = Yoj.2,59 Ti log(xi/yi) + 


ery: a 


Legendre Functions 


In this section we use various topological 
notions such as the interior, closed set and 
boundary. The definitions of these terms are 
given in the notes. Let f be a convex function 
and A = dom( f) and C = int(A). Then f is 
Legendre if 


(a) C is non-empty; 


(b) f is differentiable and strictly convex on 
C; and 

(c) limno ||[Vf(@n)ll2 = co for any 
sequence (£n)n with £n € C for all n 
and limno £n = x and some z € OC. 


Figure 26.3 f(x) = —,/a: the 


archetypical Legendre function 


The intuition is that the set {(x, f(x)) : « € dom(A)} is a ‘dish’ with ever- 
steepening edges towards the boundary of the domain. Legendre functions have 
some very convenient properties: 


THEOREM 26.6. Let f : R? — R be a Legendre function. Then, 


(a) Vf is a bijection between int(dom(f)) and int(dom(f*)) with the inverse 
(VA) =Vi 

(b) Dy(x,y) = Dye (VE (y), VF (2)) for all x,y € int(dom(f)); and 

(c) the Fenchel conjugate f* is Legendre. 


The next result formalises the ‘dish’ intuition by showing the directional 
derivative along any straight path from a point in the interior to the boundary 
blows up. You should supply the proof of the following results in Exercise 26.6. 


PROPOSITION 26.7. Let f be Legendre and x € int(dom(f)) and y € 
Oint(dom(f)), then limanı (y — x, Vf((1 — a)x + ay)) = co. 


COROLLARY 26.8. If f is Legendre and x* € argminzegoms) f(z), then x* € 
int(dom(f)). 


EXAMPLE 26.9. Let f be the Legendre function given by f(x) = $||2||3, which 


has domain dom(f) = Rt. Then, f*(x) = f(x) and Vf and V f* are the identity 
functions. 
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EXAMPLE 26.10. Let f(x) = -252 yzi when z; > 0 for all i and co 
otherwise, which has dom(f) = [0, 00)¢ and int(dom(f)) = (0,00)¢. The gradient 
is V f(x) = —1/./z, which blows up (in norm) on any sequence (£n) approaching 
dint(dom(f)) = {x € [0,co)? : 2; = 0 for some i € [d]}. Here, yx stands for 
the vector (,/%;);. In what follows we will often use the underlying convention 
of extending univariate functions to vector by applying them componentwise. 
Note that ||V f(z)|| — 0 as ||x|| > 00: ‘oo’ is not part of the boundary of dom( f). 
Strict convexity is also obvious so f is Legendre. In Exercise 26.8, we ask you 
to calculate the Bregman divergences with respect to f and f* and verify the 
results of Theorem 26.6. 


EXAMPLE 26.11. Let f(x) = $; x; log(a;) — a; be the unnormalised negentropy, 
which we met in Example 26.5. Similarly to the previous example, dom( f) = 
[0, 00), int(dom(f)) = (0,00)4 and Qint(dom(f)) = {x € [0,o0)? : x = 
0 for some i € [d]}. The gradient is V f(x) = log(), and thus ||V f(x)|| — co as 
x — Oint(dom(f)). Strict convexity also holds, hence f is Legendre. You already 
met the Bregman divergence D(x, y), which turned out to be the relative entropy 
when x,y belong to the simplex. Exercise 26.9 asks you to calculate the dual of f 
(can you guess what this function will be?) and the Bregman divergence induced 
by f* and to verify Theorem 26.6. 


The Taylor series of the Bregman divergence is often a useful approximation. 
Let gly) = Dp(x, y), which for y = x has Vg(y) = 0 and V?g(y) = V? f(z). A 
second-order Taylor expansion suggests that 


1 1 
Ds(a,y) = gly) © g(a) + (y— z, Va(2)) + sly — all 40) = 519 — alles) - 


This approximation can be very poor if x and y are far apart. Even when z and y 
are close, the lower-order terms are occasionally problematic, but nevertheless the 
approximation can guide intuition. The next theorem, which is based on Taylor’s 
theorem and measurable selections, gives an exact result (Exercise 26.15). 


THEOREM 26.12. If f is convex and twice differentiable in A = int(dom(f)) and 
x,y E A, then there exists an a € [0,1] and z = az + (1 — a)y such that 


1 
Ds(a,y) = s(@—y)'V°F(2)(@- 9). (26.2) 
Suppose furthermore that V? f is continuous on int(dom(f)); then there exists a 
measurable function g : int(dom(f)) x int(dom(f)) —> int(dom(f)) such that for 
all x,y € int(dom(f)), 
1 
Ds(x,y) = s(e@—y)'V°F(G(@,y))(@— y). 
When V?f(z) is positive definite then the right-hand side of Eq. (26.2) is 
Dsl, y) = Ìle — vld r 


THEOREM 26.13. Let n > 0 and f be Legendre and twice differentiable with 


E 
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positive definite Hessian in A = int(dom(f)). Then for all x,y € A there exists a 
z € [x,y] = {(1 — a)x + ay : a € [0,1]} such that 


Dp(x,y) 


n 
(x y, u) < z llellev f¢29)-2 i 


Although the Bregman divergence is not symmetric, the right-hand side does 


not depend on the order of x and y in the Bregman divergence except that 
the z may be different. 


Proof Let z € [x,y] be a point such that D(x, y) = 
exists by Theorem 26.12. By assumption H = V? f(z) is invertible. Applying 
Cauchy—Schwarz, 


lle — ylle fe which 


(x —y,u) < |x — yla llulla- = llulla- 42D, y) - 
Therefore, 
D5(z,y) 
n 


where the last step follows from the ever useful max;cr ax — br? = a? /(4b) which 
holds for any b > OandaeR. 


< |lell-1/ 2D 5 (2, y) — 


Dy(z,y) 
(x — y, u) a 


n 
< Talr, 


Optimisation 


The first-order optimality condition states that if x € R? is the minimiser 
of a differentiable function f : Rê > R, then V f(x) = 0. One of the things we 
like about convex functions is that when f is convex, the first-order optimality 
condition is both necessary and sufficient. In particular, if V f(x) = 0 for some 
x € R¢ then « is a minimiser of f. The first-order optimality condition can also 
be generalised to constrained minima: if f : R? > R is convex and differentiable 
and A C R? is a non-empty convex set, then 


x* €argmin,c, f(z) & Vx € A: (x—2*,Vf(a*)) > 0. (26.3) 


The necessity of the condition on the right-hand side is easy to understand by 
a geometric reasoning. If V f(a*) = 0, then the said condition trivially holds. If 
V f(a*) 4 0, the hyperplane Hz» whose normal is V f(x*) and goes through «* 
must be a supporting hyperplane of A at x*, with —V f(x*) being the outer 
normal of A at x* otherwise x* could be moved by a small amount while staying 
inside A and improving the value of f. Since A is convex, it thus lies entirely 
on the side of H,» that V f(x*) points into. This is clearly equivalent to (26.3). 
The sufficiency of the condition also follows from this geometric viewpoint as the 
reader may verify from the figure. 
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-V f(x) 
better point 


> —V f (x*) 


Figure 26.4 Illustration of first-order optimality conditions. The point at the top is not a 
minimiser because the hyperplane with normal as gradient does not support the convex 
set. The point at the right is a minimiser. 


The above statement continues to hold with a small modification even when f 
is not differentiable everywhere. In particular, in this case the equivalence (26.3) 
holds for any «* € dom(V f) with the modification that on the right side of the 
equivalence, A should be replaced by A N dom(f): 


PROPOSITION 26.14. Let f : R? — R be a convex function and A C R’ a 
non-empty convex set. Then, for any x* € dom(Vf), it holds that 


x* € argmin,c, f(x) < 
Va € ANdom(f): (#—a*,Vf(a*)) 20. (26.4) 


Further, if f is Legendre, then x* € argmin,<, f(x) implies x* € dom(V f) and 
hence also (26.4). 


The part that concerns the Legendre objective f follows by noting that by 
Corollary 26.8, «* € int(dom(f)) combined with that by Theorem 26.6(a), 
int(dom(f)) = dom(V f). 


Projections 


If A C R? and z € R$, then the Euclidean projection of x on A is I4(x) = 
argmin,¢ 4 || — y||5. One can also project with respect to a Bregman divergence 
induced by convex function f. Let 14, be defined by 


Ia, s(x) = argmin ea Dp (y, 2). 


An important property of the projection is that minimising a Legendre function 
f on a convex set A is (usually) equivalent to finding the unconstrained minimum 
on the domain of f and then projecting that point on to A. 


THEOREM 26.15. Let f : R? > R be Legendre, A C RÌ a non-empty, closed 
convex set with A N dom(f) non-empty and assume that ğ = argmin ega f(z) 
exists. Then the following hold: 
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(a) y=argmin,., f(z) exists and is unique; 
(b) y =argmin,. 4 Dy(z,9). 


The assumption that ğ exists is necessary. For example f(x) = —\/z for x > 0 
and f(x) = oo for x < 0 is Legendre with domain dom(f) = [0,00), but f does 
not have a minimum on its domain. 


Notes 


1 The ‘infinity arithmetic’ on the extended real line is as follows: 


a+ 00 = 00 for a € (—oo, ov] 
a — œ = —0O for a € [—co, 00) 
a œ = œ and a: (—00) = —0o fora >0 
a: œ = —œ and a: (—00) = co fora <0 


0-c0o =0-(-—o) =0. 


Like a/0, the value of co — œo is not defined. We also have a < oo for all a 
and a > —o for all a. 


N 


There are many ways to define the topological notions used in this chapter. 
The most elegant is also the most abstract, but there is no space for that here. 
Instead we give the classical definitions that are specific to R? and subsets. 
Let A be a subset of R?. A point x € A is an interior point if there exists 
an £ > 0 such that B(x) = {y : ||x — yll2 < €} C A. The interior of A is 
int(A) = {a € A: x is an interior point}. The set A is open if int(A) = A 
and closed if its complement A° = R? \ A is open. The boundary of A is 
denoted by ðA and is the set of points in x € R? such that for all € > 0 the 
set B(x) contains points from A and A‘. Note that points in the boundary 
need not be in A. Some examples: |0, o0) = {0} = 0(0,00) and OR” = 0). 


Bibliographic Remarks 

The main source for these notes is the excellent book by Rockafellar [2015]. The 
basic definitions are in part I. The Fenchel dual is analysed in part III while 
Legendre functions are found in part V. Convex optimisation is a huge topic. The 
standard text is by Boyd and Vandenberghe [2004]. 


Exercises 


26.1 Let f : R? — R be convex. Prove that f is continuous on int(dom(f)). 


26.2 Prove Jensen’s inequality (Theorem 26.2). Precisely, let X € R? be a 
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random variable for which E[X] exists and f : R? + R a measurable convex 
function. Prove that E[f(x)] > f(E[X]). 


HINT Let zo = E[X] € R? and define a linear function g : R > R such that 
g(xo) = f(a) and g(x) < f(x) for all x € Rt. To guarantee the existence of 
g, you may use the supporting hyperplane theorem, which states that if 
S C R” is a convex set and s € 0S, then there exists a supporting hyperplane 
containing s. 


26.3 Let f : R? > RU {—ov, oo}. 


(a) Prove that f**(x) < f(x). 
(b) Assume that f is convex and differentiable on int(dom(/)). Show that 
f**(x) = f(x) for x € int(dom(f)). 


As mentioned in the text, the assumption that f is differentiable can be 
relaxed to an assumption that the epigraph of f is closed, in which case the 
result holds over the whole domain. The proof is not hard, but you will need 
to use the sub-differential rather than the gradient, and the boundary must 
be treated with care. 


26.4 For each of the real-valued functions below, decide whether or not it is 
Legendre on the given domain: 


(a) f(x) = x? on [-1,1]. 


(x) 
(b) f(x) = —VZz on [0, 00). 
(c) f(x) = log(1/z) on [0,00) with f(0) = co 
(a) f(x) = xlog(z) on [0,co) with f(0) =0 
(e) f(x) = |z| on R. 

(x) 


(£) f(x) = max{|z|, x7} on R. 

26.5 Prove Theorem 26.3. 

26.6 Prove Proposition 26.7 and Corollary 26.8. 

26.7 Prove Proposition 26.14. 

26.8 Let f be the convex function given in Example 26.10. 


(a) For x,y € dom(f), find Dy(z, y). 
(b) Compute f*(u) and V f*(u). 

(c) Find dom(V f*). 

(da) Show that for u,v € (—oo, 0]%, 
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(e) Verify the claims in Theorem 26.6. 


26.9 Let f : R? — R be the unnormalised negentropy function from 
Example 26.11. We have seen in Example 26.5 that D(x, y) = 0, (x; log(xi/y:) + 


Yi — z). 


(a) Compute f*(u) and V f* (u). 
(b) Find dom(V f*). 
(c) Show that for u,v € RI, 


d 
Dp (u, v) = 5 exp(v;)(vi — ui) + exp(u;) — exp(v;) . 
i=1 
(d) Verify the claims in Theorem 26.6. 


26.10 Let f be Legendre. Show that f given by f(x) = f(x) + (x,u) is also 
Legendre for any u € R?. 


26.11 Let f be the unnormalised negentropy function from Example 26.5. 


(a) Prove that f is Legendre. 
(b) Given y € (0,00)?, prove that argmin,cp, , Ds(x,y) = y/llylli- 


26.12 Let a € [0,1/d] and A = Pa-ı N [a,1]? and f be the unnormalised 
negentropy function. Let y € [0, 00)? and x = argmin,< 4 Df(x, y) and assume 
that yı < y2 < --- < Ya. Let m be the smallest value such that 


d 
Ym(1 — (m — 1)a) > ay Yj - 


Show that 
a ifi<m 
Ti = 
(1 -— (m -— 1)a)yi/ Lim yj otherwise . 


26.13 (GENERALISED PYTHAGOREAN IDENTITY) Let A C R? be convex and 
closed and f : R? > R be a convex function with A N dom( f) non-empty. 


(a) Suppose that x € A and y € R? and z =I 4 f(y) and f is differentiable at 
y. Prove that 


Dy(x,y) > D(x, 2) + Dy (z,y)- 
(b) Prove that the condition that f be differentiable at y cannot be relaxed. 


26.14 Prove Theorem 26.3 and show that Part (c) does not hold in general 
when f is not differentiable at y. 


26.15 Prove Theorem 26.12 


Hint For the first part, simply apply Taylor’s theorem. For the second part, 
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use a measurable selection theorem. For example, the theorem by Kuratowski 
and Ryll-Nardzewski, which appears as Theorem 6.9.4 in the second volume of 
the book by Bogachev [2007]. 


27 


27.1 


Exp3 for Adversarial Linear Bandits 


The model for adversarial linear bandits is as follows. The learner is given an 
action set A C R? and the number of rounds n. As usual in the adversarial setting, 
it is convenient to switch to losses. An instance of the adversarial problem is a 
sequence of loss vectors y1,..., Yn taking values in Rê. In each round t € [n], the 
learner selects a possibly random action A; E€ A and observes a loss Y; = (At, yt). 
The learner does not observe the loss vector y+. The regret of the learner after n 
rounds is 


Rr =E 


m n 
oY: | -min $ (au). 
t=1 t=1 


Clearly, the finite-armed adversarial bandits discussed in Chapter 11 is a special 


case of adversarial linear bandits corresponding to the choice A = {e1,...,ea}, 
where €),...,¢€q are the unit vectors of the d-dimensional standard Euclidean 
basis. 


For this chapter, we assume that 


(a) for all t € [n] the loss satisfies y, € £ = {x € R? : supaca |(a, x)| < 1}; 
and 
(b) the action set A spans Rê. 


The latter assumption is for convenience only and may be relaxed with a 
little care (Exercise 27.7). 


Exponential Weights for Linear Bandits 


We adapt the exponential-weighting algorithm of Chapter 11. Like in that setting, 
we need a way to estimate the individual losses for each action, but now we make 
use of the linear structure to share information between the arms and decrease 
the variance of our estimators. For now we assume that A is finite, which we 
relax in Section 27.3. Let t € [n] be the index of the current round. Assuming 
the loss estimate for action a € A in round s € [n] is Y,(a), then the probability 
distribution proposed by exponential weights is given by the probability mass 
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function P, : A — [0,1] given by 


P,(a) ) sexe (= DBO o) 


where 7 > 0 is the learning rate. To control the variance of the loss estimates, 
it will be useful to mix this distribution with an exploration distribution 7 
(x: A —> [0,1] and >7,.47(a) = 1). The mixture distribution is 


P,(a) = (1 — 9) Pi(a) + x(a), 


where y is a constant mixing factor to be chosen later. The algorithm then simply 
samples its action A; from P;: 


A ~ Py. 


Recall that Y; = (Az, yz) is the observed loss after taking action A;. We need a 
way to estimate y+(a) = (a, yz). The idea is to use least squares to estimate y+ with 
_ R,AtY;, where R; € R?? is selected so that Y, is an unbiased estimate of y; 
given the history. Then the loss for a given action is estimated by Y;(a) = (a, Y;). 
To find the choice of R; that makes Y; unbiased, let E;[-] = E [-|P,] and calculate 


Ut [Ys] = Ri Ut [AA] ] ye = Ri (= Pian" Ut 
SS m 


acA 


Qr 


Using R; = Q7' leads to AA = y as desired. Of course Q; should be non- 
singular, which will follow by choosing m so that 


Q(t) = 5 m(a)aa! 
acA 
is non-singular. The complete algorithm is summarised in Algorithm 15. 


1: Input Finite action set A C R, learning rate 7, exploration distribution 7, 
exploration parameter y 

2: for t =1,2,...,n do 

3: Compute sampling distribution: 


exp (-n Zizi ¥.(a)) 
Daren op (—9 Eia Pa(0")) 


P,(a) = yn(a) + 1-7) 


4: Sample action A; ~ P, 
5: Observe loss Y; = (Az, y+) and compute loss estimates: 


%=Q;'AY and  Y¥;(a)=(a,%). 


6: end for 


Algorithm 15: Exp3 for linear bandits. 
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Regret Analysis 


THEOREM 27.1. Assume that A is non-empty and let k = |A|. For any exploration 
distribution t, for some parameters ņ and y, for all (yz); with y+ E€ L, the regret 
of Algorithm 15 satisfies 


Rn < 2\/(2g(7) + d)nlog(k) , (27.1) 


where g(r) = maxaea ||@l|Z-1(_)- Furthermore, there exists an exploration 
distribution m and parameters n and y such that g(r) < d, and hence Ry < 


2,/3dn log(k). 


The utility of (27.1) is that at times, calculating the distribution that minimises 
g(t) or sampling from it may be difficult, in which case, one may employ a 
distribution that trades off computation with the regret. 


Proof Assume that the learning rate 7 is chosen so that for each round t the 
loss estimates satisfy 


nY¥i(a)>—-1, Wace. (27.2) 


Then, by adopting the proof of Theorem 11.1 (see Exercise 27.1), the regret is 
bounded by 


log k ” 
R, < i +2yn+1% E 


t=1 


> Pla) Ý? (a) 


acA 


(27.3) 


Note that we cannot use the proof that leads to the tighter constant (7 getting 
replaced by 7/2 in the second term above) because we would loose too much in 
other parts of the proof by guaranteeing that the loss estimates are bounded 
by one (see below). To get a regret bound, it remains to set y and ņ so that 
(27.2) is satisfied and to bound E [x P,(a)¥? (a)| . We start with the latter. Let 


M: = $`, P:(a)¥?(a). By the definition of the loss estimate, 


Ñ? (a) = (a' 0 AY) SA! OF aa" OA. 


which means that M, = >>, P,(a)¥2(a) = YZATQr'At < ATQ]: = 
trace(A;A/ Q;'), and by the linearity of trace, 


L[M; | P;] < trace (£ Pajea ar) =d. 


acA 


It remains to choose y and 7. Strengthen (27.2) to |7Y;(a)| < 1 and note that 
since |Y;| < 1, 


In¥i(a)| = Ina" Qr AY] < nla" Qy* Ail - 


Recall that Q(7) = Xea T(a)aa". Clearly Q; = yQ(m), and hence Q7+ = 
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Q(m)~1/y by Exercise 27.4. Using this and the Cauchy—Schwarz inequality shows 
that 


Z . 1 (7) 
Tast Tai Tayi E 
la Qi Al < lalozllAlgo < maxv Qy v< Aa (w)v = a 
which implies that 
5 n 
InY+(a)| < q: (27.4) 


Choosing y = ng(m) guarantees |nY;(a)| < 1. Plugging this choice into (27.3), we 
get 


Ra < ost + nn(2g(m) + d) = 24 (2g(7) + d)nlog(k) , 


o finishing the proof 


where the last equality is derived by choosing 7 = GaGa tae 


of (27.1). 

For the second half, recall that by the Kiefer-Wolfowitz theorem (Theorem 21.1 
and Exercise 21.6), there exists a sampling distribution m such that g(r) < d. 
Plugging this value into (27.1), finishes the proof. 


Continuous Exponential Weights 


The dependence on log(k) in the regret guarantee provided by Theorem 27.1 
is objectionable when the number of arms is extremely large or infinite. One 
approach is to find a finite subset C C A for which 


sup min sup |(a — b, y)| < 1/n. 
acA bEC yEL 
A standard calculation shows (Exercise 27.6) shows that C can always be chosen 
so that log |C| < dlog(6dn). Then it is easy to check that Exp3 on C suffers regret 
relative to the best action in A of at most Rn = O(d\/nlog(nd)). The problem 
with this approach is that C is exponentially large in d, which makes this algorithm 
intractable in most situations. When A is convex, a more computationally 
tractable approach is to use the continuous exponential weights algorithm. 


For this section, we assume that A is convex and has positive Lebesgue 
measure. The latter condition can be relaxed with some care (Exercise 27.10). 


Let m be a probability measure supported on A. The continuous exponential 
weights policy samples A; from P; = (1 — y)P; + ya, where P, is a measure 
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supported on A defined by 
5B) Jp exp (= Ka ¥.(a)) da 
Saep (=7 D524 F(a) da 


We will shortly see that the analysis in the previous section can be copied almost 


(27.5) 


verbatim to prove a regret bound for this strategy. But what has been bought here? 
Rather than sampling from a discrete distribution on a large number of arms, we 
now have to sample from a probability measure on a convex set. Sampling from 
arbitrary probability measures is itself a challenging problem, but under certain 
conditions there are polynomial time algorithms for this problem. The factors 
that play the biggest role in the feasibility of sampling from a measure are (a) 
the form of the measure or its density and (b) how the convex set is represented. 
As it happens, the measure defined in the last display is log-concave, which 
means that the logarithm of the density, with respect to the Lebesgue measure 
on A, is a concave function. 


THEOREM 27.2. Let p(a) x I,(a)exp(—f(a)) be a density with respect to the 
Lebesgue measure on A such that f : A —> R is a convex function. Then there 
exists a polynomial-time algorithm for sampling from p, provided one can compute 
the following efficiently: 


1 (First-order information): V f(a) where a € A. 
2 (Euclidean projections): argmin,¢ 4 ||z — yl|2 where y € R¢. 


The probability distribution defined by Eq. (27.5) satisfies the first condition. 
Efficiently computing a projection on to a convex set is a more delicate issue. 
A general criterion that makes this efficient is access to a separation oracle, 
which is a computational procedure ¢ that accepts a point x € R? as input and 
responds ¢(a) = TRUE if x € A and otherwise ¢(x) = u, where (y, u) > (x,u) for 
all y € A (see Fig. 27.1). 


Figure 27.1 Separation oracle returns the normal of a hyperplane that separates x from 
A whenever x ¢ A. When x € A, the separation oracle returns TRUE. 


Define log, (x) = max(0, log(x)). 


THEOREM 27.3. Assume that A is compact, convex and has volume vol(A) = 
Ja da > 0. Then an appropriately tuned instantiation of the continuous 
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exponential weights algorithm with Kiefer-Wolfowttz exploration has regret bounded 
by 


Ry < 2dy/3n(1 + log, (2n/d)). 


The proof of Theorem 27.3 relies on the following proposition, which we leave 
as an exercise (Exercise 27.11). 


PROPOSITION 27.4. Let K C R? be a compact convex set with vol(K) > 0, u € R? 
and let x* = argmin, <x (z, u). Then, 


oe ¢ — ae = ( as (Meee) : 


The left-hand side in the above display is the logarithmic Laplace transform 
of the uniform measure on K — {x* } evaluated at u. 


Proof of Theorem 27.3 As before, choosing y = dn ensures that |n(a, Y;)| < 1 for 
all a € A (see the proof of Theorem 27.1). The standard argument (Exercise 27.9) 
shows that 


vol(A) 
Saep (=n Dia (Vila) — Pi(a*))) da 


Using again that n|(a, Y;)| < 1 and Proposition 27.4 with u = yi Ê, shows 
that 


Rn < 


log + 3ndn. (27.6) 


Sle 


pa Hesin Ul ens 2dy/3n(1 + log, (2n/d)). 


Notes 


1 A naive implementation of Algorithm 15 has computation complexity O(kd+d?) 
per round. There is also the one-off cost of computing the exploration 
distribution, the complexity of which was discussed in Chapter 21. The real 
problem is that k can be extremely large. This is especially true when the 
action set is combinatorial. For example, when A = {a € R? : a; = +1} is 
the corners of the hypercube |A| = 24, which is much too large unless the 
dimension is small. Such problems call for a different approach that we present 
in the next chapter and in Chapter 30. 

It is not important to find exactly the optimal exploration distribution. All 
that is needed is a bound on Eq. (27.4), which for the exploration distribution 
based on the Kiefer-Wolfowitz theorem is just d. However, unlike in the finite 
case, exploration is crucial and cannot be removed (Exercise 27.8). 


N 
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3 The O(,/n) dependence of the regret on the horizon is not improvable, but the 
linear dependence on the dimension is suboptimal for certain action sets and 
optimal for others. An example where improvement is possible occurs when A 
is the unit ball, which is analysed in the next chapter. 

4 A slight modification of the set-up allows the action set to change in each 
round, but where actions have identities. Suppose that k € {1,2,...} and A; = 
{ai(t),...,a%(t)} and the adversary chooses losses so that maxac A, |(a, yz)| < 1 
for all t. Then a straightforward adaptation of Algorithm 15 and Theorem 27.1 
leads to an algorithm for which 

n 


Ya- atu) < 2,/3dn log(k) . 


t=1 


Rn = max E 
i€[k] 


The definition of the regret still compares the learner to the best single action 
in hindsight, which makes it less meaningful than the definition of the regret 
in Chapter 19 for stochastic linear bandits with changing action sets. These 
differences are discussed in more detail in Chapter 29. See also Exercise 27.5. 


Bibliographic Remarks 


The results in Sections 27.1 and 27.2 follow the article by Bubeck et al. [2012], 
with minor modifications to make the argument more pedagogical. The main 
difference is that they used John’s ellipsoid over the action set for exploration, 
which is only the right thing when John’s ellipsoid is also a central ellipsoid. Here 
we use Kiefer-Wolfowitz, which is equivalent to finding the minimum volume 
central ellipsoid containing the action set. Theorem 27.2, which guarantees 
the existence of a polynomial time sampling algorithm for convex sets with 
gradient information and projections is by Bubeck et al. [2015b]. We warn the 
reader that these algorithms are not very practical, especially if theoretically 
justified parameters are used. The study of sampling from convex bodies is quite 
fascinating. There is an overview by Lovász and Vempala [2007], though it is a 
little old. The continuous exponential weights algorithm is perhaps attributable 
to Cover [1991] in the special setting of online learning called universal portfolio 
optimisation. The first application to linear bandits is by Hazan et al. [2016]. 
Their algorithm and analysis are more complicated because they seek to improve 
the computation properties by replacing the exploration distribution based on 
Kiefer—Wolfowitz with an adaptive randomised exploration basis that can be 
computed in polynomial time under weaker assumptions. Continuous exponential 
weights for linear bandits using the core set of John’s ellipsoid for exploration 
(rather than Kiefer-Wolfowitz) was recently analysed by van der Hoeven et al. 
[2018]. Another path towards an efficient O(d,/n log(-)) policy for convex action 
sets is to use the tools from online optimisation. We explain some of these ideas 
in more detail in the next chapter, but the reader is referred to the paper by 
Bubeck and Eldan [2015]. 
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Exercises 


27.1 (‘MIXED’ EXP3 ANALYSIS) Prove Eq. (27.3). 


27.2 (DEPENDENCE ON THE RANGE OF LOSSES) Suppose that instead of assuming 
yz € L, we assume that y; € {y € R? : supge4 | (a, y)| < b} for some known b > 0. 
Modify the algorithm to accommodate this change, and explain how the regret 
guarantee changes. 


27.3 (DEPENDENCE ON THE RANGE OF LOSSES (11)) Now suppose that a < b 
are known and y € {y € R@: (a,y) € [a,b] for all a € A}. How can you adapt 
the algorithm now, and what is its regret? 


27.4 (INVERSION REVERSES LOEWNER ORDERS) Let A, B € R?*? and suppose 
that A > B and B is invertible. Show that A~! < B71. 


27.5 (CHANGING ACTION SETS) Provide the necessary corrections to 
Algorithm 15 and its analysis to prove the result claimed in Note 4. 


HINT You will need to choose a new exploration distribution in every round. 
Otherwise everything is more or less the same. 


27.6 (COVERING NUMBERS FOR CONVEX SETS) For K C Rê let ||ællk = 
supycx |(x, y)|. Let A C R? and £ = {y : |lyl|a < 1}. Let N(A, £) be the size of 
the smallest subset C C A such that min,ee ||x — x’ ||c¢ < £ for all z € A. Show 
the following: 


(a) When A= {x € R° : ||z||y-1 < 1}, we have N(A, £) < (3/e)%. 
(b) When A is convex, bounded and span(A) = R? we have N(A,¢) < (3d/e)?. 
(c) For any bounded A C R? we have N(A,<) < (6d/e)¢. 


Hint For the first part, find a linear map from A to the Euclidean ball and 
use the fact that the Euclidean ball can be covered with a set of size (3/e)?. For 
the second part use the fact that for any symmetric, convex and compact set K 
there exists an ellipsoid E = {x: ||a||v < 1} such that E CK C dE. 


27.7 (LOW RANK ACTION SETS (1)) In the definition of the algorithm and the 
proof of Theorem 27.1, we assumed that A spans Rĉ and that it has positive 
Lebesgue measure. Show that this assumption may be relaxed by carefully 
adapting the algorithm and analysis. 


27.8 (NECESSITY OF EXPLORATION) We saw in Chapter 11 that the exponential 
weights algorithm achieved near-optimal regret without mixing additional 
exploration. Show that exploration is crucial here. More precisely, construct a 
finite action set A and reward sequence y¢ € £ such that the regret of Algorithm 15 
with y = 0 becomes very poor (even with 7 optimally tuned) relative to the 
optimal choice. 
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27.9 (CONTINUOUS EXPONENTIAL WEIGHTS) Complete the missing steps in the 
proof of Theorem 27.3. 


27.10 (LOW RANK ACTION SETS (I1)) In the definition of the algorithm and 
the proof of Theorem 27.3, we assumed that A spans R? and that it has positive 
Lebesgue measure. Show that this assumption may be relaxed by carefully 
adapting the algorithm and analysis. 


27.11 (VOLUME BOUNDS) Prove Proposition 27.4. 
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28.1 


Follow-the-Regularised-Leader and 
Mirror Descent 


In the last chapter, we showed that if A C Rt has k elements, then the regret of 
Exp3 with a careful exploration distribution has regret 


Rn = O(\/dnlog(k)). 


We also showed the continuous version of this algorithm has regret at most 


Rn = O(dy nlog(n)) . 
Although this algorithm can often be made to run in polynomial time, the degree 
tends to be high and the implementation complicated, making the algorithm 
impractical. In many cases this can be improved, both in terms of the regret and 
computation. In this chapter we demonstrate this in the case when A is the unit 
ball by showing that for this case there is an efficient, low-complexity algorithm 
for which the regret is R, = O(,/dnlog(n)). More importantly, however, we 
introduce a pair of related algorithms called follow-the-regularised-leader and 
mirror descent, which are powerful tools for the design and analysis of bandit 
algorithms. In fact, the exponential weights algorithm turns out to be a special case. 


\ 
Online Linear Optimisation Qo 
> 
Mirror descent originated in the convex optimisa- 2 
Cm 
a 


tion literature. The idea has since been adapted 
to online learning and specifically to online lin- 
ear optimisation. Online linear optimisation a a 
is the full information version of the adversarial Or 
linear bandit, where at the end of each round the 
learner observes the full vector y,. Let A C R Figure 28.1 Mirror descent is a 
be a convex set and £L C Rĉ be an arbitrary set modern art, as well as science 
called the loss space. Let y1,...,Yn be a sequence of loss vectors with y, € £ 
for t € [n]. In each round the learner chooses a; € A and subsequently observes 
the vector y+. The regret relative to a fixed comparator a € A is 

n 


Rn(a) = $ (a — a, y), 


t=1 
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and the regret is Rn = maxac a R,(a). We emphasise that the only difference 
relative to the adversarial linear bandit is that now ys is observed rather than 
(az, yz). Actions are not capitalised in this section because the algorithms presented 
here do not randomise. 


Mirror descent 

The basic version of mirror descent has two extra parameters beyond n and A. A 
learning rate 7 > 0 and a convex function F : R + R with domain D = dom(F). 
Usually F will be Legendre. The function F is called a potential function or 
regulariser. In the first round, mirror descent predicts 


a, = argmin,. 4 F(a). (28.1) 
Subsequently it predicts 


dt41 = argminge a (nla, yz) + Dr(a,at)) , (28.2) 


where Dp(a, a+) is the F-induced Bregman divergence between a and a. Implicit 
in the definition is that a,,a2,... are well-defined. The reader is invited to 
construct examples when this is not the case (Exercise 28.2). A simple case when 
(az)?_, are well-defined is when A is compact and F is Legendre. 


Follow-the-Regularised-Leader 

Like mirror descent, follow-the-regularised-leader depends on a convex potential 
F with domain D = dom(F’) and predicts a; = argmin,, 4 F(a). In subsequent 
rounds t € [n], the predictions are 


t 
41 = argminge 4 (Ze Ys) + F o) . (28.3) 
s=1 

The intuition is that the algorithm chooses a;41 to be the action that performed 
best in hindsight with respect to the regularised loss. Again, the definition of 
follow-the-regularised-leader implicitly assumes that (a;)?_, are well-defined. As 
for mirror descent, the regularisation serves to stabilise the algorithm, which 
turns out to be a key property of good algorithms for online linear prediction. 


Follow-the-leader chooses the action that appears best in hindsight, 
A441 = argminge 4 eG Ys). In general this algorithm is not well suited 
for online linear optimisation because the absence of regularisation makes it 
unstable (Exercise 28.4). 


Equivalence of Mirror Descent and Follow-the-Regularised-Leader 
At first sight these algorithms do not look that similar. To clarify matters, let us 
suppose that F is Legendre with domain D C A. In this setting, mirror descent 
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and follow-the-regularised-leader are identical. To see this, let 


®,(a) = (a, yt) + Dr(a, at) = nla, yt) + F(a) — F(at) — (a — at, VF (az) 


Now mirror descent chooses a1 to minimise ®;. The reader should check that 
the assumption that F is Legendre on domain D C A implies that the minimiser 
occurs in the interior of D C A and that V®;(a;41) = 0 (see Exercise 28.1). This 
means that ny, = VF (at) — VF (at41), and so 


t t 
VF (ai41) = -qu + VF (a) = VF(a1) -nX ys =—0 > ys, 
s=1 


s=l 


where the last equality is true because a; is chosen as the minimiser of F in 
AND =D, and again the fact that F is Legendre ensures this minimum occurs 
at an interior point where the gradient vanishes. Follow the regularised leader 
chooses a+; to minimise 6/(a) = n$} t; (a, ys) + F(a). The same argument 
shows that V®/(a441) = 0, which means that 


t 
VF (a4) =) ys. 
s=l 


The last two displays and the fact that the gradient for Legendre functions is 
invertible shows that mirror descent and follow-the-regularised-leader are the 
same in this setting. 


The equivalence between these algorithms is far from universal. First of 
all, it does not generally hold when F is not Legendre or its domain 
is larger than A. Second, in many applications of these algorithms, the 
learning rate or potential change with time, and in either case the 
algorithms will typically produce different action sequences. For example, if 
a learning rate 7; is used rather than 7 in the definition of ®,, then mirror 
descent chooses VF (at41) = =F sys, while follow-the-regularised- 
leader chooses VF (at41) = -m S Ys. We return to this issue in the 
notes and exercises. 


EXAMPLE 28.1. Let A = R? and F(a) = |a]. Then VF(a) = a and 
D(a, a) = 4 ||a — ar||2. Clearly F is Legendre and D = A, so mirror descent and 


follow-the-regularised-leader are the same. By simple calculus we see that 


: 1 
Ot41 = argMIMN ,cRa na, yt) + gil = all = at — NY, 


which may be familiar as online gradient descent with linear losses. For the 
extension to nonlinear convex loss functions, see Note 12. 


EXAMPLE 28.2. Let A be a compact convex subset of R? and F(a) = $]|lal|3. 


28.1.1 
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Then mirror descent chooses 
. 1 
at+1ı = argminge 4 7(@, yt) + gil — a|? = (ae — nye) , (28.4) 


where II(a) is the Euclidean projection of a on to A. This algorithm is usually 
called online projected gradient descent. On the other hand, for follow-the- 
regularised-leader we have 


t t 
1 
arı = argminge aN > (0, ys) + 5lla— aull3 = 1 @» n) ; 
s=1 


s=1 


which may be a different choice than that of mirror descent. 


EXAMPLE 28.3. The exponential weights algorithm that appeared in various 
forms on numerous occasions in earlier chapters is a special case of mirror descent 
corresponding to choosing the constraint set A as the simplex in Rt and choosing 
F to be the unnormalised negentropy function of Example 26.5. In this case 
follow-the-regularised-leader chooses 


t d 


at+1 = argminge 4 nX (a, Ys) + 5 a; log(a;) — a; . 
s=1 i=1 


You will show in Exercise 28.8 that 
exp (-n Di Yai) 


D exp (-n D ™ i (28.5) 


QAt+1,i = 


A Two-Step Process for Implementation 


Solving the optimisation problem in Eq. (28.2) is often made easier by using 
Theorem 26.15 from Chapter 26. Assume F is Legendre and A is compact and 
non-empty, and suppose that 


VEF(a)-— ny € int(dom(F*)) for ala € AND and ye L. (28.6) 


Then the solution to Eq. (28.2) can be found using the following two-step 
procedure: 


G41 = argmingep Nla, yt) + Dr(a, at) and (28.7) 
at+1 = argmin,. 4 Dr(a, t41). (28.8 


Eq. (28.6) means the first optimisation problem can be evaluated explicitly as 
the solution to 


Ut + VF (G41) = VF (at) =0. (28.9) 


Since F is Legendre, Theorem 26.6 shows that VF is a bijection between 
int(dom(F’)) and int(dom(F*)), which means that @41 = (VF) HVF (at) — nye). 
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The optimisation problem in Eq. (28.8) is usually harder to calculate analytically, 
but there are important exceptions, as we shall see. 


All potentials and losses that appear in positive results in this book guarantee 
that mirror descent (and also follow-the-regularised-leader) are well defined, 
and that condition in Eq. (28.6) holds. 


The two-step implementation of mirror descent also explains its name. The 
update in round t can be seen as transforming the action a; € A into the ‘mirror’ 
(dual) space using VF, where it is combined with the most recent (scaled) loss 
nys. Then VF! is used to transform the updated vector back to the original 
(primal) space. The function VF is called the mirror map. 


The same idea works for follow-the-regularised-leader. Assuming F is 
Legendre, A is compact and nonempty and =n `t; ys € int(dom(F*)), 
then for follow-the-regularised-leader 


t 
ai41 = Uap F (ve @» n) , 
Pal 


where II, 7 is the projection on to A with respect to Dp as described in 
Section 26.6. 


Some of the differences between follow-the-regularised-leader and mirror descent 
are illustrated in Fig. 28.2, which shows how the algorithms differ once projections 
start to occur. 


Regret Analysis 


Although mirror descent and follow-the-regularised-leader are not the same, the 
bounds presented here are identical. The theorem for mirror descent has two 
parts, the first of which is a little stronger than the second. To minimise clutter, 
we abbreviate Dp by D. 


THEOREM 28.4 (Mirror descent regret bound). Let n > 0 and F be Legendre with 
domain D and A C R? be a non-empty conver set with int(dom(F)) NA # 0. Let 
@1,---,;Qn41 be the actions chosen by mirror descent, which are assumed to be 
well-defined. Then, for anya € A, the regret of mirror descent is bounded by 


F(a) — F(a ` Š 
Ra(a) < oe +S (u — G41, Y) — z 2o Datta, a1) 
t=1 t=1 


Furthermore, suppose that Eq. (28.6) holds and G2,G3,...,@n41 are given by 
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Constraint set Dual 


ay 


=ny2 


—1Y3 


-the-regularised-leader 


Figure 28.2 Illustration of follow-the-regularised-leader and mirror descent. The 
constraint set is A, and the function IL4,r is the projection on to A with respect 
to the Bregman divergence induced by Legendre function F. Follow-the-regularised- 
leader accumulates the scaled losses in the dual space, mapping back to the primal 
using the inverse map VF~'. Mirror descent computes the next iterate by ar+ı = 
Ta,r(VF (VF (az) — nyt). The algorithms generally behave differently in the presence 
of projections. In the figure, the algorithms behave the same until the fourth iterate, 
after which the projection that appeared in the computation of the third iterate breaks 
the equivalence. 


Eq. (28.7). Then, 
Rey : Go — Fla;) +Y Blow dea) . 


Proof Fix a € A. The result trivially holds when a ¢ D. Hence, we assume that 
a € D. For the first part of the claim, we split the inner product: 


(at — a, Ye) = (at — Ge41, Ye) + (Qty — a, Yt) - 


In Exercise 28.1, you will show that a, € int(dom(F’)), and hence the Bregman 
divergence D(b, at) = F(b) — F (a+) — (b — at, VF (az)) for any b € dom(F). By 
definition, a¢+1 = argminye 4 n(b, yt) + D(b, at). Hence, the first-order optimality 
conditions for a¢+ı (Proposition 26.14) show that 


(a — ai nyi + VF (at41) — VF (at)) = 0. 
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Reordering and using the definition of the Bregman divergence, 
1 
(ar41 — a, Yt) < n“ — arti, VE (ar41) — VE (at)) 
1 
at (D(a, az) — D(a, at+1) — D(at+1,at)) - 


Using this, along with the definition of the regret, 


Ry = Soa — a, yt) 

t=1 
n 1 m 

<$ (u -aye + = (a, a4) — D(a, at+1) — D(at+1,at)) 
t=1 a t=1 
n 1 n 

= X (u — at41, yt) + — | D(a, a1) (a, Qn41) -5 D( Qt41, at) 
t=1 1 t=1 
Z F(a)— F(a 

ms So (at — A141, Yt) ia) 7 ta) a SI at) , (28.10) 
t=1 t=1 


where the final inequality follows from the fact that D(a,an41) > 0 and 
D(a,a1) < F(a) — F(az), the latter of which is true by the first-order optimality 
conditions for a, = argmin,< 4 F(b). To see the second part, note that 


1 X 
(at — at+1, Yt) = = (0 — Qe41, VF (at) — VF (Ge41)) 


n 
1 7 3 

a 7 (D(at41, at) T D(a, Gt+1) = D(at41, G@t41)) 
1 ‘ 

< a (D(at+1, at) + D(at, ŭt+1)) - 


The result follows by substituting this into Eq. (28.10). 


The assumption that a; minimises the potential was only used to bound 
D(a,a1) < F(a) — F(a). For a different initialisation, the following bound 
still holds: 


Raye : (Pro) 4 S5 Dlan ies) , (28.11) 


t=1 
As we shall see in Chapter 31, this is useful when using mirror descent to 
analyse non-stationary bandits. 


The first part of Theorem 28.4 also holds for follow-the-regularised-leader as 
stated in the next result, the proof of which is left for Exercise 28.5. 


THEOREM 28.5 (Follow-the-regularised-leader regret bound). Let 7 > 0, F 
be convex with domain D, A C RÌ be a non-empty convex set. Assume that 
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@1,---,;@n41 Chosen by follow-the-regularised-leader are well defined. Then, for 
any a E€ A, the regret of follow-the-reqularised-leader is bounded by 


n 


R,(a) < F(a) ao So (ai Qt41; Yt) = = SO Dlana a) F 


t=1 t=1 


We now give two applications of the regret bound of Theorem 28.4 for mirror 
descent. The same results would hold for the same problems for FTRL just in this 
case we would need to use Theorem 28.5. Let diamp(A) = maxa pea F(a) — F(b) 
be the diameter of A with respect to F. 


PROPOSITION 28.6 (Regret on the unit ball). Let A = BY = {a € R° : |la|l2 < 1} 
be the standard unit ball and assume y, € B2 for allt. Then mirror descent with 
potential F(a) = $]lal|3 and ņ = /1/n is well defined and its regret satisfies 
Rn < Vn. 


Proof That mirror descent is well defined follows by a direct calculation (cf. 
Example 28.1). By Eq. (28.9), we have G41 = at — Ny: so 
- Les n? 
D(at, i+1) = z llä — all = z lyel- 
Therefore, since diamp (A) = 1/2 and ||y:l|2 < 1 for all t, 


diamp( 


+” 
Ry < ———— +3 Il < 5 5 te vn. 


PROPOSITION 28.7 (Regret on the simplex). Let A = P4a—ı be the probability 
simpler and y, € L = [0,1]% for allt. Then mirror descent with the unnormalised 
negentropy potential and n = ,/2log(d)/n is well defined and its regret satisfies 
Rn < \/2nlog(d). 


Proof That mirror descent is well-defined follows because the simplex is compact. 
The Bregman divergence with respect to the unnormalised negentropy potential 
for a,b € A is D(a, b) = se a; log(a;/b;). Therefore, 


F(a) — F(a £ 1 Š 
Rn(a) < TOAS pea = XO D(a1, a1) 
"1 t=1 n t=1 
log(d) < ol 
< E2 4 Y po —avsallles = 5 2 gle = acs 
t=1 t=1 


log(d) n< 2 — log(d) m 
< “Ie a. — = 
< 1Y lodh < EO 4 B= Vantogta), 


where the first inequality follows from Theorem 28.4, the second from Pinsker’s 
inequality and the facts that diamp (A) = log(d). In the third inequality, we 
used ‘optimise to bound’. In particular, we used that for any a € R and b > 0, 
max,cr ax — bx? /2 = a*/(2b). The last inequality follows from the assumption 
that |[yelloc <1. 
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The last few steps in the above proof are so routine that we summarise their 
use in a corollary, the proof of which we leave to the reader (Exercise 28.6). 


COROLLARY 28.8. Let F be a Legendre potential and ||- ||; be a norm on R? for 
each t € |n] such that Dp(ar41, a4) > 4llat+1 — al|?. Then the regret of mirror 
descent or follow-the-regularised-leader satisfies 


diamr(A) nv 
Rn < E t3 5 lulli 
t=1 


where ||y||i« = Maxz,\2\|,<1(@,y) is the dual norm of ||- Ile- 


It often happens that the easiest way to bound the regret of mirror descent is to 
find a norm that satisfies the conditions of Corollary 28.8. Often, Theorem 26.13 
provides a good approach. 


EXAMPLE 28.9. To illustrate a suboptimal application of mirror descent, 
suppose we had chosen F(a) = $]|a||3 in the setting of Proposition 28.7. Then 
Dp(ar41, at) = $||ar41 —ae||3 suggests choosing || ||; to be the standard Euclidean 
norm. Since diamp(A) = 1/2 and ||- |l2. = || - ||2, applying Corollary 28.8 shows 


that 


But now we see that ||y:||2 can be as large as d, and tuning 7 would lead to a 


rate of O(Vnd) rather than O(,/n log(d)). 


Both Theorems 28.4 and 28.5 were presented for the oblivious case where 
(yz) #1 are chosen in advance. This assumption was not used, however, and in 
fact the bounds continue to hold when y+ is chosen strategically as a function 
of a1, Y1,---,Yt-1, Qt. This is analogous to how the basic regret bound for 
exponential weights continues to hold in the face of strategic losses. But 
be cautioned, this result does not carry immediately to the application of 
mirror descent to bandits, as discussed at the end in Note 9. 


Application to Linear Bandits 


We now show how mirror descent and follow-the-regularised-leader can be used to 
construct algorithms for adversarial linear bandit problems. Like in the previous 
chapter, the adversary chooses a sequence of vectors y1, ..., Yn with ys € L C R4. 
In each round the learner chooses A; € A C R@ and observes (A+, yz). The regret 
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relative to action a € A is 


R,(a) = 


| Sta en . 


The regret is R, = maxge,4 Rn(a). The application of mirror descent and follow- 
the-regularised-leader to linear bandits is straightforward. The only difficulty 
is that the learner does not observe y, but instead (A;, y+). The solution is to 
replace y; with an estimator, which is typically some kind of importance-weighted 
estimator as in the previous chapter. Because estimation of y; is only possible 
using randomisation, the algorithm cannot play the suggested action of mirror 
descent, but instead plays a distribution over actions with the same mean as the 
proposed action. This is often necessary anyway, when A is not convex. Since 
the losses are linear, the expected additional regret by playing according to the 
distribution vanishes. The algorithm is summarised in Algorithm 16. We have 
switched to capital letters because the actions are now randomised. 


THEOREM 28.10 (Regret of Mirror-Descent and FTRL with bandit feedback). 
Suppose that Algorithm 16 is run with Legendre potential F, convex action set 
A C R? and learning rate n > 0 such that the loss estimators are unbiased: 
oY, | A] = y: for all t € |n]. Then the regret for either variant of Algorithm 16, 
provided that they are well defined, is bounded by 


F(a)—F(Ai) € 
EOE a At — Atyi, Ýi) — 2a Aiti, At) 
t=1 


R,(a) <E 


Furthermore, letting 
Ay = argmin,cdom(F) (4, Ŷ,) + Dr(a, At) 


and assuming that —nY, + VF(a) € VF(dom(F)) for all a € A almost surely, 
the regret of the mirror descent variation satisfies 


STE [D(Ae, Äe) - 


Proof Using the definition of the algorithm and the assumption that Y; is 
unbiased given A; and that P; has mean A; leads to 
A|| , 


) [(At, ye] = (At, yt) = a [  [(At, yt) | Ae] =E | ) [äs 2a) 


where the last equality used the linearity of expectations. Hence, 


12 ai= aw =E 3 — 0,%) 


which is the expected random regret of mirror descent or follow-the-regularised- 
leader on the recursively constructed sequence Y;. The result follows from 
Theorem 28.4 or Theorem 28.5 and the note at the end of the last section 


R,(a) = 


> 
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that says these theorems continue to hold even for recursively constructed loss 
sequences. g 


1: Input Legendre potential F, action set A and learning rate n > 0 
2: Choose A; = argminge Andom(F) F (a) 

3: for t=1,...,n do 

A: Choose measure P; on A with mean A; 

5 Sample action A; from P, and observe (Az, yt) 

6 Compute estimate Y; of the loss vector yt 

7 Update: 


Ati = argminge Andom(F) (4, Y;) + Dr(a, As) (Mirror descent) 
t 
Arti a argmin „e Andom(F) n X (a, Y) T F(a) 


s=1 


(follow-the-regularised-leader) 


8: end for 


Algorithm 16: Online stochastic mirror descent /follow-the-regularised-leader. 


Linear Bandits on the Unit Ball 


To illustrate the power of these methods, we return to adversarial linear bandits 
and the special case where the action set is the unit ball. In the previous chapter, 
we showed that continuous exponential weights on the unit ball with Kiefer- 
Wolfowitz exploration has a regret of 


Rn = O(dy nlog(n)) . 


Surprisingly, follow-the-regularised-leader with a carefully chosen potential 
improves on this bound by a factor of Vd. 

For the remainder of this section, let ||- || = || - |2 be the standard Euclidean 
norm and A = B? be the standard unit ball. In order to instantiate follow-the- 
regularised-leader we need a potential, a sampling rule, an unbiased estimator and 
a learning rate. Note that the only source of randomness is the randomisation in 
the algorithm. Hence, let E;,[-] = E[- | A,,...,A;_1]. We start with the sampling 
rule and estimator. Recall that in round t we need to choose a distribution on 
A with mean A; and sufficient variability that the variance of the estimator is 
not too large. Given the past, let E; and U; be independent, where E; € {0,1} is 
such that E,[E;] = 1 — || A;|| and U; is uniformly distributed on {+e1,...,+ea}. 
The algorithm chooses 
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In other words, Æ; = 1 indicates that the algorithm explores, which happens with 
probability 1 — || A;||. Clearly, E,[A:] = Az. (The sampling distribution P, is just 
the law of A; given the past, which remains implicit.) For the estimator we use a 
variant of the importance-weighted estimator from the last chapter: 


~ — dE: At(At, yt) 


= i (28.12) 
1— || Ail 


The reader can check for themself that this estimator is unbiased. Next, we 
inspect the contents of our magician’s hat and select the potential 


F(a) = — log (1 — llall) — Ilall- 


There is one more modification. Rather than instantiating follow-the-regularised- 
leader with action set A, we use A = {x € R@: ||æll2 < r}, where r < 1 is a 
radius to be tuned subsequently. The reason for this modification is to control 
the variance of the estimator in Eq. (28.12), which blows up as A; gets close to 
the boundary. You will show in Exercise 28.7 that 


7 t—1 
= —nly_ R A 
ps | 2 (pamela with ÊY Ô, (28.13) 
1+ nll- 1 
where I(x) is the projection operator on to A with respect to || - ||2. 


1: Input Learning rate 7 > 0 
2: fort=1,...,n do 
3: Compute 


a —nhe_ x cals 
i= Ma ) with Lai=>lY, 
s=1 


1+ nlll 


4: Sample E; € {0,1} from Binomial with bias 1 — || A;|| and U; uniformly 
on {e1,...,ea} g 
5: Play action A; = EtU: + Ce 
A d E; AylA 
6: Observe (A+, y+) and estimate loss vector Y, = ere 
— ||Ae 


7: end for 


Algorithm 17: Follow-the-regularised-leader for linear bandits on the unit ball 


THEOREM 28.11. Assume that (y:)/_1 are a sequence of losses such that ||yz||2 < 1 
for allt. Suppose that Algorithm 16 is run using the sampling rule, estimator 
and potential as described above, shrunken action set A with r = 1 — 2nd where 
the learning rate is n = \/log(n)/(3dn). Then, the algorithm is well defined and 
its regret satisfies Rn < 2\/3ndlog(n). 
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You might notice that in some regimes this is smaller than the lower bound 
for stochastic linear bandits (Theorem 24.2). There is no contradiction 
because the adversarial and stochastic linear bandit models are actually 
quite different. More details are in Chapter 29. 


Proof That the algorithm is well defined follows because A is compact. Let 
a* = argminge 4 >; (@, yt) be the optimal action. Then 


n 


DEET — ra“, yt) 


t=1 


Rn =E 


+ YO (ra* — a" y) < Rn(ra") + (1-r)n, 
t=1 


where the inequality follows from the definition of A and Cauchy—Schwarz. By 
Theorems 26.13 and 28.10, provided that the Hessian of F is invertible over the 
interior of its domain, 


diam p (A) 4 
n 


x 
a 


Rn(ra*) < 


NI3 


5 ivre , (28.14) 
t=1 


The algorithm is stable in the sense that no matter how the losses are chosen, 
At+1 cannot be too far from A+. This also means that Z is close to A;. By 
definition, 7||Y;|| < nd/(1 — r) = 1/2. Combining this with Eq. (28.13) shows that 


1- ||Zill San l- aA + 0- Anll _ oo fı, 1 = |All \ 
1—|[Aell ~ aeto,1] 1 — |All 1 — |All 


1+ nllLi- 1+ nl|Le- 
< max į 1, trole < max 4 1, A+ al <2 
1+ nllLell 1/2 + nll Li- |l 
Here, the second inequality is proved by noting that if the maximum is not one, 
|All < ||Aé||. The next step is to find the Hessian of F, which is 
_ I 4 aa! = I 
1— |a] lalla- llall)? ~ 1- llall 
This verifies that the Hessian is invertible over the interior of F and thus justifies 
Eq. (28.14). Now, we also have (V?F(a))~! < (1 — |Ja|]), and so 
AATE 7 2] _ 72m (1 = Zil|) E (Ut, ye)? 
leery] <8 [0 - 1zi] = er | OPE te) <a. 
The diameter satisfies diam (A) < log(1/(1 — r)), and hence 


V? F(a) 


< 2,/3ndlog(n) , 


where the last two relations follow from the choices of r and 7, respectively. 
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We could have used mirror descent rather than follow-the-regularised- 
leader with a slightly more complicated proof and the same bound except 
for constants. Using continuous exponential weights and the analysis in 
Section 27.3 would yield a bound that is a factor of Vd worse than the above, 
and we believe that this cannot be improved. 


Notes 


1 Our assumptions on the potential and action set in the analysis of mirror 


descent (Theorem 28.4) can be relaxed significantly. What is important is 
that F is convex and the directional derivative v ++ VF (æ) is linear for 
all values for which it exists. Our assumptions are chosen to ensure that 
at € int(dom(F’)), which for Legendre F means that VF (a+) exists, and hence 
VvuF (a) = (v, VF (at)) is linear. A comprehensive examination of various 
generalisations is given by Joulani et al. [2017]. For follow-the-regularised- 
leader, convexity of F suffices, as you will show using directional derivatives in 
Exercise 28.5. 


Finding a1 for both mirror descent and follow-the-regularised-leader requires 
solving a convex optimisation problem. Provided the dimension is not too 
large and the action set and potential are reasonably nice, there exist practical 
approximation algorithms for this problem. The two-step process described in 
Eqs. (28.7) and (28.8) is sometimes an easier way to go. Usually (28.7) can 
be solved analytically, while (28.8) can be quite expensive. In some important 
special cases, however, the projection step can be written in closed form or 
efficiently approximated. 


We saw that follow-the-regularised-leader with a carefully chosen potential 
function achieves O(,/dnlog(n)) regret on the 2-ball. On the œ% ball 
(hypercube), the optimal regret is O(d,/n). Interestingly, as n tends to infinity 
the optimal dependence on the dimension for A = B4 = {x € R° : ||x||p < 1} 
with p > 1 is either d or Vd with a complete classification given by Bubeck 
et al. [2018]. 


Adversarial linear bandits with A = P,_, are essentially equivalent to k-armed 
adversarial bandits. There exists a potential such that the resulting algorithm 
satisfies Ran = O(Vkn), which matches the lower bound up to constant factors 
and shaves a factor of vlog k from the upper bounds presented in Chapters 11 
and 12. For more details, see Exercise 28.15. 

Most of the bounds proven for adversarial bandits have a worst-case flavour. 
The tools in this chapter can often be applied to prove adaptive bounds. In 
Exercise 28.14, you will analyse a simple algorithm for k-armed adversarial 
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bandits for which 


aR n 
R,=O|,|k (: $ mig »ove) log (2) 


Bounds of this kind are called first-order bounds [Allenberg et al., 2006, 
Abernethy et al., 2012, Neu, 2015b, Wei and Luo, 2018]. The log(n/k) term 
can be improved to log(k) using a more sophisticated algorithm /analysis. 

Both mirror descent and follow-the-regularised-leader depend on the potential 
function. Currently there is no characterisation of exactly what this potential 
should be or how to find it. At least in the full information setting, there are 
quite general universality results showing that if a certain regret is achievable 


by some algorithm, then that same regret is nearly achievable by mirror descent 
with some potential [Srebro et al., 2011]. In practice this result is not useful for 
constructing new potential functions, however. There have been some attempts 
to develop ‘universal’ potential functions that exhibit nice behaviour for any 
action sets [Bubeck et al., 2015b, and others]. These can be useful, but as yet 
we do not know precisely what properties are crucial, especially in the bandit 
case. 

When the horizon is unknown, the learning rate cannot be tuned ahead of time. 
One option is to apply the doubling trick. A more elegant solution is to use a 
decreasing schedule of learning rates. This requires an adaptation of the proofs 
of Theorems 28.4 and 28.5, which we outline in Exercises 28.11 and 28.12. This 
is one situation where mirror descent and follow-the-regularised-leader are not 
the same and where the latter algorithm is usually to be preferred. 

In much of the literature the potential is chosen in such a way that mirror descent 
and follow-the-regularised-leader are the same algorithm. For historical reasons, 
the name mirror descent is more commonly used in the bandit community. 
Unfortunately ‘mirror descent’ is often used, sometimes with qualifiers, when 
the algorithm being analysed is actually follow-the-regularised-leader. This is 
confusing and makes it hard to identify for which algorithm the results actually 
hold. Naming aside, we encourage the reader to keep both algorithms in mind, 
since the analysis of one or the other can sometimes be slightly easier. 
Mirror descent and follow-the-regularised-leader are used as modules for 
converting loss sequences to distributions. Since these losses depend on past 
actions, it is crucial that both algorithm are well-behaved in the full-information 
setting when the losses are chosen non-obliviously. This does not translate to 
the bandit setting for a subtle reason. Let R,,(a) = yr (At — a, yz) be the 
random regret so that 


R,=E nay ato) =E 


acA 


X (4, ye) — ih aC Ye) 
t=1 t=1 


The second sum is constant when the losses are oblivious, which means the 
maximum can be brought outside the expectation, which is not true if the loss 


10 


11 


12 
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vectors are non-oblivious. It is still possible to bound the expected loss relative 
to a fixed comparator a so that 


n 


D4 — 4, yr) 


t=1 


R,(a) =E <B 


? 


where B is whatever bound obtained from the analysis presented above. Using 


max, Rn(a) < maxa R,(a) — Rn (a) + maxa Rn(a) shows that 


=E f < ; Rala) — . 
Ry nay (| <Bt [ina R,(a) Ralo) 


The second term on the right-hand side can be bounded using tools from 
empirical process theory, but the resulting bound is O(yn) only if V|Ên(a)] = 
O(n). In general, however, the variance can be much larger (for an example, see 
Exercise 11.6). We emphasise again that the non-oblivious regret is a strange 
measure because it does not capture the reactive nature of the environment. 
The details of the application of empirical process theory is beyond the scope 
of this book. For an introduction to that topic, we recommend the books by 
van der Vaart and Wellner [1996a], van de Geer [2000], Boucheron et al. [2013] 
and Dudley [2014]. 

The price of bandit information on the unit ball is an extra ,/dlog(n) (compare 
Proposition 28.6 and Theorem 28.11). Except for log factors, this is also true for 
the simplex (Proposition 28.7 and Note 4). One might wonder if the difference 
is always about Vd, but this is not true. The price of bandit information can 
be as high as O(d). Overall the dimension dependence in the regret in terms of 
the action set is still not well understood except for special cases. 

The poor behaviour of follow-the-leader in the full information setting depends 
on (a) the environment being adversarial rather than stochastic and (b) the 
action set having sharp corners. When either of these factors is missing, follow- 
the-leader is a reasonable choice [Huang et al., 2017b]. Note that with bandit 
feedback, the failure is primarily due to a lack of exploration (Exercises 4.12 
and 4.13). 

A generalisation of online linear optimisation is online convex optimisation, 
where the adversary secretly chooses a sequence of convex functions f1,..., fn- 
In each round the learner chooses a; E€ A and observes the entire function ft. 
As usual, the regret is relative to a € A is 


Ry(a) = = filar) — fela). 


One way to tackle this problem is to linearise the loss functions. Let 
yt = V filat). Then, by convexity of the loss functions, 


Ra(a) < Xola: — a), 


which shows that an algorithm for online linear optimisation can be used to 
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13 
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analyse the more general case. Now look again at Example 28.1 and notice 
that online mirror descent with a quadratic potential and linearised losses is 
really the same gradient descent we know and love. Online convex optimisation 
is a rich topic by itself. We refer the interested reader to the books by Shalev- 
Shwartz [2012] and Hazan [2016]. 

There is a nice application of online linear optimisation to minimax theorems. 
Let X and Y be arbitrary sets. For any function f : X x Y > R, 


inf sup f(x,y) > wey inf f(x,y). 
cTEX yeY 


Under certain conditions, the inequality becomes an equality. Theorems 
guaranteeing this are called minimax theorems. The following result by Sion 
[1958] is one of the more generic variants. The statement uses notions of 
quasi-convexity and semi-continuity, which are defined in the next note. 


THEOREM 28.12 (Sion’s minimax theorem). Suppose that X and Y are convex 
subsets of linear topological spaces with at least one of X or Y compact. Let 
f:X xY—->R be a function such that f(-,y) is lower semi-continuous and 
quasi-convex for ally € Y and f(a,-) is upper semi-continuous and quasi- 
concave for alla € X. Then 


inf sup f(x,y) = sup inf f(x,y). 
xrEX yeY yeY zE 
There is a short topological proof of this theorem [Komiya, 1988]. You will 

use the tools of online linear optimisation to analyse two special cases in 
Exercise 28.16. When X and Y are probability simplexes and f is linear, the 
resulting theorem is von Neumann’s minimax theorem [von Neumann, 1928]. 
The minimax theorems form a bridge between minimax adversarial regret and 
Bayesian regret, which we discuss in Chapters 34 and 36. 
Let X be a subset of a linear topological space and f : X — R. The function f 
is quasi-convex if f~'((—oo,a)) is convex for all a € R and quasi-concave 
if — f is quasi-convex. f is upper semi-continuous if for all x € X and 
E€ > 0 there exists a neighborhood U of x such that f(y) < f(x) + for all 
y E U. It is lower semi-continuous if for all x € X and e > 0 there exists a 
neighborhood U of x such that f(y) > f(x) —« for all y € U. 


Bibliographic Remarks 


The results in this chapter come from a wide variety of sources. The online convex 


optimisation framework was popularised by Zinkevich [2003]. The framework 
has been briefly considered by Warmuth and Jagota [1997], then reintroduced 
by Gordon [1999] (without noticing the earlier work of Warmuth and Jagota). 
While the framework was introduced relatively recently, the core ideas have 
been worked out earlier in the special case of linear prediction with nonlinear 
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losses (the book of Cesa-Bianchi and Lugosi [2006] can be used as a reference 
to this literature). Mirror descent was first developed by Nemirovsky [1979] and 
Nemirovsky and Yudin [1983] for classical optimisation. In statistical learning, 
follow-the-regularised-leader is known as regularised risk minimisation and 
has a long history. In the context of online learning, Gordon [1999] considered 
follow-the-regularised-leader and called it ‘generalised gradient descent’. The name 
seems to originate from the work of Shalev-Shwartz [2007] and Shalev-Shwartz and 
Singer [2007]. An implicit form of regularisation is to add a perturbation of the 
losses, leading to the ‘follow-the-perturbed-leader’ algorithm [Hannan, 1957, Kalai 
and Vempala, 2002], which is further explored in the context of combinatorial 
bandit problems in Chapter 30 (and see also Exercise 11.7). Readers interested in 
an overview of online learning will like the short books by Shalev-Shwartz [2012] 
and Hazan [2016], while the book by Cesa-Bianchi and Lugosi [2006] has a little 
more depth (but is also older). As far as we know, the first explicit application of 
mirror descent to bandits was by Abernethy et al. [2008]. Since then the idea has 
been used extensively, with some examples by Audibert et al. [2013], Abernethy 
et al. [2015], Bubeck et al. [2018] and Wei and Luo [2018]. Mirror descent has 
been adapted in a generic way to prove high-probability bounds by Abernethy 
and Rakhlin [2009]. The reader can find (slightly) different proofs of some mirror 
descent results in the book by Bubeck and Cesa-Bianchi [2012]. The results for 
the unit ball are from a paper by Bubeck et al. [2012], but we have reworked 
the proof to be more in line with the rest of the book. Mirror descent can be 
generalised to Banach spaces. For details, see the article by Sridharan and Tewari 
[2010]. 


Exercises 


28.1 Let F : R? + RU {oo} be Legendre with domain D C R? and A C Rê be 
convex, and for b € int(D) and y € R? let (a) = (a, y) + Dr (a,b). Suppose that 
c € argminaea ®(a) exists and AN int(D) 4 0. Show that c € int(D) 


28.2 (ILL-DEFINED ACTIONS) Given an example of a non-empty bounded convex 
action set A, convex potential F and sequences of losses (y;)#_, where the choices 
of mirror descent and/or follow-the-regularised-leader either: 


(a) exist but are not unique; 
(b) do not exist at all. 


Prove that if F is Legendre and A is non-empty and compact, then (a;)?_, exist 
and are unique for both mirror descent and follow-the-regularised-leader. 


28.3 Prove the correctness of the two-step procedure described in Section 28.1.1. 


28.4 (LINEAR REGRET FOR FOLLOW-THE-LEADER) Let A = [—1,1], and let 
yı = 1/2 and ys = 1 for odd s > 1 and ys = —1 for even s > 1. 
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(a) Recall that follow-the-leader (without regularisation) chooses a; = 
argmin, 5‘ (a, Ys). Show that this algorithm suffers linear regret on the 
above sequence. 

(b) Implement follow-the-regularised-leader or mirror descent on this problem 
with quadratic potential F(a) = a? and plot a; as a function of time. 


28.5 (REGRET FOR FOLLOW-THE-REGULARISED-LEADER) Prove Theorem 28.5. 
28.6 (REGRET IN TERMS OF LOCAL DUAL NORMS) Prove Corollary 28.8. 


28.7 (FOLLOW-THE-REGULARISED-LEADER FOR THE UNIT BALL) Prove the 
equality in Eq. (28.13). 


28.8 (EXPONENTIAL WEIGHTS AS MIRROR DESCENT) Prove the equality in 
Eq. (28.5). 


28.9 (EXP3 AS MIRROR DESCENT) Let A = Px_1 be the simplex, F the 
unnormalised negentropy potential and ņ > 0. Let P, = argmin,< 4 F(p), and for 
t>1, 


P;+1 = argminyng 4 nlp, Ê) + Dp(p, P,) ’ 
where Y;; = I {A; = i} Yi / Pr; and A; is sampled from P,. 
(a) Show that the resulting algorithm is exactly Exp3 from Chapter 11. 


(b) What happens if you replace mirror descent by follow-the-regularised-leader, 


t 
Pi = argminge 4 So (p, Ys) + F(p) ? 


s=1 


28.10 (EXP3 AS MIRROR DESCENT (11)) Here you will show that the tools in 
this chapter not only lead to the same algorithm, but also the same bounds. 


(a) Let Pi = argminy.cjo,c0)k n(p, i) + Dr(p, P;). Show both relations in the 
following display: 
k 
Dr (Pt, P1) = 5 Pri (exp(—n¥) -1+ nv) < 


i=l 


| 


k 
Dae oe 
g= 


1 z à nk 
(b) Show that -E Y Dr(P:, Pey1)| < a 


t=1 
(c) Show that diamp(P,_-1) = log(k). 
(d) Conclude that for appropriately tuned 7 > 0, the regret of Exp3 satisfies, 


Rn < V2nk log(k) . 


Hint Use Theorem 26.6(b). 


28.11 (MIRROR DESCENT AND CHANGING LEARNING RATES) Let A be a 
convex set and yj,.--,Yn ELC R2. Let F be Legendre with domain D with 
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ANint(D) non-empty and assume that Eq. (28.6) holds. Let 7,™,..-;1% > 0, 
@1,42,---,4n41 E A and dg,...,Gn+41 be sequences so that no = co, ay = 
argmin,< 4 F(a) and 

G41 = argmingep h(a, Yt) + Dr(a, at), 


de41 = argminge anp Dr(a, d41) - 


Show that for all a € A, 


n n D a a n D aa -D oa 
(a) Rala) = X (u =a < F(a Dr (at, G41) + y F(a, at) F( t+), 
t=1 t=1 t=1 t 
and 
Dpr(ae, G41) wa 1 1 ) 
(b) Rna) < Dr(a,a — . 
2 p ‘ nt M-1 


The statement allows the time-varying learning rate sequence (7); to be 
constructed in any way. This flexibility can be useful when designing adaptive 
algorithms. A sequence of learning rates (7;), is said to be non-anticipating 
if for each t, 7, depends on data available at the end of round t. 


28.12 (FOLLOW-THE-REGULARISED-LEADER AND CHANGING POTENTIALS) Like 
in the previous exercise, let A be non-empty and convex and y1,..., Yn E L C R4. 
Let Fi,...,Fn,Fn41 be a sequence of convex functions and ®,(a) = F;(a) + 
Da i(a, i and a, = argmin,. 4 ®:(a), which you may assume are well defined. 


(a) Show that 


< 5 ((at — at+1; Yt) — DF, (at+1,0t)) 
+ Fari (a) — Fi (a1) + X (Filary) — Fitila) - 


(b) Show that if F; = F/m and (nz is decreasing with nn = m+1, then 


n 


F(a) — minsea F(b) | es _ Dr(at41, 4) 
ns M 2 (i ne nt 


Again, the statement applies to any sequence of Legendre functions, including 
those that are constructed based on the past. 


28.13 (ANYTIME VERSION OF EXP3) Consider the k-armed adversarial bandit 
problem described in Chapter 11, where the adversary chooses (y;)?_, with 
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yz € [0,1]*. Let P; € Py_1 be defined by 
exp (=m DE fa) 
Eja exp (—m Eii Pas) 


where (7)?2, is an infinite sequence of learning rates and Y;; =1 {A: = i} yer / Pri 
and A; is sampled from P,. 


Pu = 


(a) Let A = Pk—ı be the simplex, F be the unnormalised negentropy potential, 
F,(p) = F(p)/m and ©,(p) = F(p)/m + X41 (p, Ys). Show that P, is the 
choice of follow-the-regularised-leader with potentials (F,)?_, and losses 
Yidi 

(b) Assume that (7,)?_, are decreasing and then use Exercise 28.12 to show that 


log(k ” . Dr(Pr41,P, 
R, < = ) + ya- F( = +) 
n t=1 $ 


(c) Use Theorem 26.13 in combination with the facts that Ý > 0 for all 7 and 
Yi; = 0 unless A; = i to show that 


P Dp(P, P, 
(P, Pia Ŷ) F( t+1> +) < nt , 
Nt 2PrA, 


l 
EA E a a EE D 


(e) Choose (m)? so that Rn < R log(k) for all n > 1. 


28.14 (THE LOG BARRIER AND FIRST-ORDER BOUNDS) Your mission in this 
exercise is to prove first-order bounds for finite-armed bandits as studied in 
Chapter 11. The notation is the same as the previous exercise. Let (y;,)?_, be 
a sequence of loss vectors with y € [0,1]* for all t and F(a) = — ae log(a;). 
Consider the instance of follow-the-regularised-leader for bandits that samples 
A, from P, defined by 


t—1 
P; = argmin,cp,_, t Xip, Ys) + F(p) . 
s=1 


(a) Show a particular, non-anticipating choice of the learning rates (7j)?_, so 


that 
ay nVk 
< j i , 28.1 
Rn <k+2 efas Ea] ) we ( i ) (28.15) 


(b) Prove that any algorithm satisfying Eq. (28.15) also satisfies 


k k 
Ry < k+ klog (HE) +0 e (1+ mig Ya) log (2E } 
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where C is a suitably large universal constant. 


Hint For choosing the learning rate, you might take inspiration from 
Theorem 18.3. 


The algorithm in this exercise is a simplified variant of the algorithm analysed 
by Wei and Luo [2018]. 


28.15 (MINIMAX REGRET FOR FINITE-ARMED ADVERSARIAL BANDITS) Let 
(yt)? be a sequence of loss vectors with y € [0,1]* for all t and F(a) = 
—2 yo vai. Consider the instance of follow-the-regularised-leader for k-armed 
adversarial bandits that samples A; € [k] from P, defined by 


t—1 
P, = argmin eP, nX (p, Y.) + F(p), 
s=1 


where Y,; =1{A, = i} ysi/ Psi is the importance-weighted estimator of ys; and 
n > 0 is the learning rate. 


(a) Show that 


t-1 =2 
Pa = (0 + 5 Pu) ; 
s=1 
where A € R is the largest value such that Pa € P,_1. 
(b) Show that P;+1,4; < Pra, for allt € [n = t): 


(c) Show that V?F(x) = 4 diag(x~*/?). 

(d) Show that diamp (A) < 2vk. 

(e) Prove that the regret of this algorithm is bounded by Rn < V8kn. 

(£) What happens if you use mirror descent instead of follow-the-regularised- 
leader. Are the resulting algorithms the same? And if not, what can you 
prove for mirror descent? 

(g) Explain how you would implement this algorithm. 

(h) Prove that if the learning rate is chosen in a time-dependent way to be m = 
1/Vt, then the resulting instantiation of follow-the-regularised-leader satisfies 
R,, = O(Vnk) for adversarial bandits and Ry = O(a, >0 108(n)/Ai) for 
stochastic bandits with losses in [0, 1]. 


The algorithm in the above exercise is called the implicitly normalised 
forecaster (INF) and was introduced by Audibert and Bubeck [2009]. The 
last part of the exercise is very difficult. For ‘hints’, see the articles by 
Zimmert and Seldin [2019] and Zimmert et al. [2019]. 
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28.16 (MINIMAX THEOREM) In this exercise you will prove simplified versions 
of Sion’s minimax theorem. 


(a) Use the tools from online linear optimisation to prove Sion’s minimax theorem 
when X = P;,_; and Y = P;_} and f(x,y) =x! Gy for some G € R**), 

(b) Generalise your result to the case when X and Y are non-empty, convex, 
compact subsets of R? and f : X x Y — R is convex/concave and has 
bounded gradients. 


HINT Consider a repeated simultaneous game where the first player chooses 
(x+); and the second player chooses (y:)?2,. The loss in round t to the first 
player is f(a, yz), and the loss to the second player is —f(xz,y,). See what 
happens to the average iterates Zn = + 0", a and Ym = + yy ye when (z+) 
and (y+) are chosen by (appropriate) regret-minimising algorithms. For the second 
part, see Note 12. Also observe that there is nothing fundamental about X and 
Y both having dimension d. 


28.17 (COUNTEREXAMPLE TO SION WITHOUT COMPACTNESS) Find examples 
of X, Y and f that satisfy the conditions of Sion’s theorem except that neither 
X nor Y are compact and where the statement does not hold. Can you choose f 
to be bounded? 
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29.1 


The Relation between Adversarial 
and Stochastic Linear Bandits 


The purpose of this chapter is to highlight some of the differences and connections 
between adversarial and stochastic linear bandits. As it turns out, the connection 
between these are not as straightforward as for finite-armed bandits. We focus 
on three topics: 


(a) For fixed action sets, there is a reduction from stochastic linear bandits to 
adversarial linear bandits. This does not come entirely for free. The action 
set needs to be augmented for things to work (Section 29.2). 

(b) The adversarial and stochastic settings make different assumptions about 
the variability of the losses/rewards. This will explain the apparently 
contradictory result that the upper bound for adversarial bandits on the unit 
ball is O(,/dn log(n)) (Theorem 28.11), while the lower bound for stochastic 
bandits also for the unit ball is Q(d\/n) (Theorem 24.2). 

(c) When the action set is changing, the notion of regret in the adversarial 
setting must be carefully chosen, and for the ‘right’ choice, we do not yet 
have effective algorithms (Section 29.4). 


We start with a unified view of the two settings. 


Unified View 


To make the notation consistent, we present the stochastic 
and adversarial linear bandit frameworks again using losses 
for both. Let A C R? be the action set. In each round, the 
learner chooses A; € A and receives the loss Y;, where 


0 


Figure 29.1 A tricky 
relationship 


Y; = (At, 0) + me, (Stochastic setting) (29.1) 
Y; = (At, 4) , (Adversarial setting) (29.2) 


and (7), is a sequence of independent and identically 
distributed 1-subgaussian random variables and (0;)?_, is a sequence of loss 
vectors chosen by the adversary. As noted earlier, the assumptions on the noise 
can be relaxed significantly. For example, if F, = 0(A1,Y1,..., At, Yt, At+1), then 
the results of the previous chapters hold as soon as m is 1-subgaussian conditioned 
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on F—1. The expected regret for the two cases are defined as follows: 


Ry = 3 z (A+, 0)}] — n inf (a, 0), (Stochastic setting) 
Rn = a (At, 0] -n inf (a, On). (Adversarial setting) 


In the last display, ôn = 4 Soy 9 is the average of the loss vectors chosen by 
the adversary. 


29.2 Reducing Stochastic Linear Bandits to Adversarial Linear 
Bandits 


To formalise the intuition that adversarial environments are harder than stochastic 
environments, one may try to find a reduction where learning in the stochastic 
setting is reduced to learning in the adversarial setting. Here, reducing problem 
E (‘easy’) to problem H (‘hard’) just means that we can use algorithms designed 
for problem H to solve instances of problem E. In order to do this, we need to 
transform instances of problem E into instances of problem H and translate back 
the actions of algorithms designed for H to actions for problem E. To get a regret 
bound for problem E from a regret bound for problem H, one needs to ensure 
that the losses translate properly between the problem classes. 

Of course, based on our previous discussion, we know that if there is a reduction 
from stochastic linear bandits to adversarial linear bandits, then somehow the 
adversarial problem must change so that no contradiction is created in the curious 
case of the unit ball. To be able to use an adversarial algorithm in the stochastic 
environment, we need to specify a sequence (6;); so that the adversarial feedback 
matches the stochastic one. Comparing Eq. (29.1) and Eq. (29.2), we can see 
that the crux of the problem is incorporating the noise ņ into 6; while satisfying 
the other requirements. One simple way of doing this is by introducing an extra 
dimension for the adversarial problem. 

In particular, suppose that the stochastic problem is d-dimensional so that 
A C Rê. For the sake of simplicity, assume furthermore that the noise and 
parameter vector satisfy |(a, 0) + m| < 1 almost surely for all a € A and that 
a, = argmin,< 4(a,9) exists. Then define Aaug = {(a,1) : a € A} C R+ and let 
the adversary choose 6; = (0, m) € R?*!. Here, we slightly abuse notation: for 
x € R? and y € R, we use (x,y) to denote the d+ 1 dimensional vector whose 
first d components are those of x and whose last component is y. The reduction 
is now straightforward: for t = 1,2,..., do the following: 


1 Initialise adversarial bandit policy with action set Aaug. 
2 Collect action A; = (A;,1) from the policy. 
3 Play A; and observe loss Y;. 
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4 Feed Y; to the adversarial bandit policy, increment t and repeat from step 2. 


Suppose the adversarial policy guarantees a bound Bn on the expected regret: 


/ x 
R, =E 


Let a = (as,1). Note that for any a’ = (a,1) E€ Aaug, (At, 0) — (a, 0) = 
(Ai, 04) — (a', 0+) and thus adversarial regret, and eventually Bn, will upper 
bound the stochastic regret: 


n 


3 |X C(A, 0) — nt) = 


< Rp < Bn 


| Stal 8) -n lal, On) 


Therefore, the expected regret in the stochastic bandit is also at most Bn. We 
have to emphasise that this reduction changes the geometry of the decision sets 
for both the learner and the adversary. For example, if A = B¢ is the unit ball, 
then neither Agug nor 


t=1 


a€Aaug 


fı ER’: sup |la,y)| < 7 


are unit balls. It does not seem like this should make much difference, but at 
least in the case of the ball, from our Q(d\/n) lower bound on the regret for the 
stochastic case, we see that the changed geometry must make the adversary more 
powerful. This reinforces the importance of the geometry of the action set, which 
we have already seen in the previous chapter. 

While the reduction shows one way to use adversarial algorithms in stochastic 
environments, the story seems to be unfinished. When facing a linear bandit 
problem with some action set A, the user is forced to decide whether or not 
the environment is stochastic. Strangely enough, for stochastic environments the 
recommendation is to run your favorite adversarial linear bandit algorithm on the 
augmented action set. What if the environment may or may not be stochastic? 
One can still run the adversarial linear bandit algorithm on the original action 
set. This usually works, but the algorithm may need to be tuned differently 
(Exercises 29.2 and 29.3). 


Stochastic Linear Bandits with Parameter Noise 


The real reason for all these discrepancies is that the adversarial linear bandit 
model is better viewed as relaxation of another class of stochastic linear bandits. 
Rather than assuming the noise is added after taking an inner product, assume that 
(6:)#_, is a sequence of vectors sampled independently from a fixed distribution 
v on R¢. The resulting model is called a stochastic linear bandit with 
parameter noise. This new problem can be trivially reduced to adversarial 
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bandits when Supp(v) is bounded (Exercise 29.1). In particular, there is no need 
to change the action set. 


Combining the stochastic linear bandits with parameter noise model with the 
techniques in Chapter 24 is the standard method for proving lower bounds 
for adversarial linear bandits. 


Parameter noise environments form a subset of all possible stochastic 
environments. To see this, let 0 = f xv(dx) be the mean parameter vector 
under v. Then the loss in round t is 


(Az, 04) = (Az, 0) + (At,  — 0). 


Let E;[-] = E[-| 7:1]. By our assumption that v has mean 0, the second term 
vanishes in expectation, E;[(Az, 0; — 0)] = 0. This implies that we can make a 
connection to the ‘vanilla’ stochastic setting by letting h = (Az, 0, — 0). Now 
consider the conditional variance of fj: 


Ve lft] = Ei Ar, 0: — 0)?] = Aj E.[(@, — 0)(6, -—0)'"]Ap; = AEA, — (29.3) 


where X is the covariance matrix of multivariate distribution v. Eq. (29.3) implies 
that the variance of the noise 7, now depends on the choice of action and in 
particular the noise variance scales with the length of A;. This can make parameter 
noise problems easier. For example, if v is a Gaussian with identity covariance, 
then V; [7] = ||Az|/3 so that long actions have more noise than short actions. 
By contrast, in the usual stochastic linear bandit, the variance of the noise is 
unrelated to the length of the action. In particular, even the noise accompanying 
short actions can be large. This makes quite a bit of difference in cases when 
the action set has both short and long actions. In the standard stochastic model, 
shorter actions have the disadvantage of having a worse signal-to-noise ratio, 
which an adversary can exploit. 

This calculation also provides the reason for the different guarantees for the 
unit ball. For stochastic linear bandits with 1-subgaussian noise the regret is 
O(d\/n), while in the last chapter we showed that for adversarial linear bandits, 
the regret is O(Vdn). This discrepancy is explained by the variance of the noise. 
Suppose that v is supported on the unit sphere. Then the eigenvalues of its 
covariance matrix sum to one and if the learner chooses A; from the uniform 
probability measure u on the sphere, then 


Valid] = fa Sadu(a) = 1/4. 


By contrast, in the standard stochastic model with 1-subgaussian noise, the 
predictable variation of the noise is just 1. If the adversary were allowed to choose 
its loss vectors from the sphere of radius Vd, then the expected predictable 
variation would be 1, matching the standard stochastic case, and the regret would 
scale linearly in d, which also matches the vanilla stochastic case. This example 


ies 
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further emphasises the importance of the assumptions that restrict the choices of 
the adversary. 


The best way to think about the standard adversarial linear model is that it 
generalises the stochastic linear bandit with parameter noise. Linear bandits 
with parameter noise are sometimes easier than the standard model because 
parameter noise limits the adversary’s control of the signal-to-noise ratio 
experienced by the learner. 


Contextual Linear Bandits 


In practical applications the action set is usually changing from round to round. 
Although it is possible to prove bounds for adversarial linear bandits with changing 
action sets, the notion of regret makes the results less meaningful than what 
one obtains in the stochastic setting. Suppose that (A;)?_, are a sequence of 
action sets. In the stochastic setting, the actions (A); selected by the LinUCB 
algorithm satisfy 


n 


S (4 — a7, 8) 


t=1 


= O(dy/n), 


where af = argmax,¢ 4, (a, 9) is the optimal action in round t. This definition of 
the regret measures the right thing: the action aj really is the optimal action in 
round t. The analogous result for adversarial bandits would be a bound on 


n 


X (4: — a4 (9), 9) 


t=1 


Rn (O) = max E 
JEO 


; (29.4) 


where © is a subset of R? and a;(0) = argmax,¢ 4, (a, 0). Unfortunately, however, 
we do not currently know how to design algorithms for which this regret is small. 
For finite O, the techniques of Chapter 27 are easily adapted to prove a bound of 
O(/dnlog |O|), but this algorithm is (a) not computationally efficient for large 
|O|, and (b) choosing © as an £-covering of a continuous set does not guarantee a 
bound against the larger set. Providing a meaningful bound on Eq. (29.4) when 
© is a continuous set like {0 : ||O||2 < 1} is a fascinating challenge. The reader 
may recall that the result in Exercise 27.5 provides a bound for adversarial linear 
bandits with changing action sets. However, in this problem the actions have 
‘identities’, and the regret is measured with respect to the best action in hindsight, 
which is a markedly different objective than the one in Eq. (29.4). 
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1 
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For the reduction in Section 29.2, we assumed that |Y;| < 1 almost surely. 
This is not true for many classical noise models like the Gaussian. One way to 
overcome this annoyance is to apply the adversarial analysis on the event that 
|Y}| < C for some constant C > 0 that is sufficiently large that the probability 
that this event occurs is high. For example, if 7, is a standard Gaussian and 
supgca |(a,0)| < 1, then C may be chosen to be 1+ y4log(n), and the failure 
event that there exists a t such that |(4:,0} + m| > C has probability at most 
1/n by Theorem 5.3 and a union bound. 

The mirror descent analysis of adversarial linear bandits also works for 
stochastic bandits. Recall that mirror descent samples A; from a distribution 
with a conditional mean of A;, and suppose that 6; is a conditionally unbiased 
estimator of 0. Then the regret for a stochastic linear bandit with optimal 
action a* can be rewritten as 


D-a, s|- 


which is in the standard format necessary for the analysis of mirror 
descent /follow-the-regularised-leader. In the stochastic setting, the covariance 
of the least squares estimator 6, will not be the same as in the adversarial 
setting, however, which leads to different results. When 6, is biased, the bias 
term can be incorporated into the above formula and then bounded separately. 
Consider a stochastic bandit with A = B the unit ball and Y; = (Az, 0) +m 
where |Y;| < 1 almost surely and ||6||2 < 1. Adapting the analysis of the 
algorithm in Section 28.4 leads to a bound of R, = O(d,/nlog(n)). Essentially 
the only change is the variance calculation, which increases by roughly a factor 
of d. The details of this calculation are left to you in Exercise 29.2. When A is 
finite, the analysis of Exp3 with Kiefer-Wolfowitz exploration (Theorem 27.1) 
leads to an algorithm for which R, = O(,/dnlog(k)). For convex A, you can 
use continuous exponential weights (Section 27.3). 


Rn =E 


|X (A: E ro] = 


t=1 


J S (At = oh) ; 


t=1 


You might wonder whether or not an adversarial bandit algorithm is well 
behaved for stochastic bandits where the model is almost linear (the misspecified 
linear bandit). Suppose the loss is nearly linear in the sense that 


Yı = L(A) +71, 


where €(A;) = (A:,6) + e(Ay) and € : A > R is some function with small 
supremum norm. Because e( A+) depends on the chosen action, it is not possible 
to write Y, = (A;,;) for 6; independent of A+. When A = B¢ is the unit 
ball, you will show in Exercise 29.4 that an appropriately tuned instantiation 
of follow-the-regularised-leader satisfies Ry, = O(d,/nlog(n) + enVd), where 
€ = supye4€(a). This improves by logarithmic factors on the more generic 
algorithm in Chapter 22. 
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Bibliographic Remarks 


Linear bandits on the sphere with parameter noise have been studied by Carpentier 
and Munos [2012]. However they consider the case where the action set is the 
sphere and the components of the noise are independent so that the reward is 
Xı = (At,0 +) where the coordinates of ne € RÊ are independent with unit 
variance. In this case, the predictable variation is V[X; | Az] = > A?, = 1 for 
all actions A; and the parameter noise is equivalent to the standard model. We 
are not aware of any systematic studies of parameter noise in the stochastic 
setting. With only a few exceptions, the impact on the regret of the action set 
and adversary’s choices is not well understood beyond the case where A is an 
,-ball, which has been mentioned in the previous section. A variety of lower 
bounds illustrating the complications are given by Shamir [2015]. Perhaps the 
most informative is the observation that obtaining O(Vdn) regret is not possible 
when A= {a + z : ||x||2 < 1} is a shifted unit ball with a = (2,0,...,0), which 
also follows from our reduction in Section 29.2. 


Exercises 


29.1 (REDUCTIONS) Let A C R! be an action set and L = fy € R° : 
supaca l(a, y)| < 1}. Take an adversarial linear bandit algorithm that enjoys 
a worst-case guarantee Bn on its n-round expected regret Rn when the adversary 
is restricted to playing 6; € L. Show that if this algorithm is used in a stochastic 
linear bandit problem with parameter noise where 6; ~ v and Supp(v) C £, then 
the expected regret R’, is still bounded by B,. 


29.2 (FOLLOW-THE-REGULARISED-LEADER FOR STOCHASTIC BANDITS (1)) 
Consider a stochastic linear bandit with A = Bł and loss Y, = (Ar, 0) + where 
(m)#_1 are independent with zero mean and Y, € [—1,1] almost surely. Adapt the 
proof of Theorem 28.11 to show that with appropriate tuning the algorithm in 
Section 28.4 satisfies Rn < Cd,/nlog(n) for universal constant C > 0. 


HINT Repeat the analysis in the proof of Theorem 28.11, update the learning 
rate and check the bounds on the norm of the estimators. 


29.3 (FOLLOW-THE-REGULARISED-LEADER FOR STOCHASTIC BANDITS (I!)) 
Repeat the previous exercise using exponential weights or continuous exponential 
weights with Kiefer-Wolfowitz exploration where 


(a) A is finite; and 


(b) A is convex. 


29.4 (MISSPECIFIED LINEAR BANDITS) Let A C R? and (m)? be a sequence 
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of independent zero-mean random variables and assume the loss is 
Yı = L(A) +, 
where ¢(A;) = (Az, 0) + €(Az) and € = supac 4 €(a) and |¥;| < 1 almost surely. 


(a) Suppose that A = BY. Show that the expected regret of an appropriately 
tuned version of the algorithm in Section 28.4 satisfies 


Rn < C(dv/nlog(n) + env 4d) , 


where C > 0 is a universal constant. 

(b) Do you think the result from Part (a) can be improved? 

(c) Suppose that A is finite. What goes wrong in the analysis of exponential 
weights with Kiefer-Wolfowitz exploration (Algorithm 15)? 


Part VII 
Other Topics 
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In the penultimate part, we collect a few topics to which we could not dedicate 
a whole part. When deciding what to include, we balanced our subjective views 
on what is important, pedagogical and sufficiently well understood for a book. 
Of course we have played favourites with our choices and hope the reader can 
forgive us for the omissions. We spend the rest of this intro outlining some of the 
omitted topics. 


Continuous-Armed Bandits 

There is a small literature on bandits where the number of actions is infinitely 
large. We covered the linear case in earlier chapters, but the linear assumption 
can be relaxed significantly. Let A be an arbitrary set and F a set of functions 
from A — R. The learner is given access to the action set A and function class 
F. In each round, the learner chooses an action A; € A and receives reward 
Xı = f(A) +m, where m is noise and f € F is fixed, but unknown. Of course 
this set-up is general enough to model all of the stochastic bandits so far, but is 
perhaps too general to say much. One interesting relaxation is the case where A 
is a metric space and F is the set of Lipschitz functions. We refer the reader to 
papers by Kleinberg [2005], Auer et al. [2007], Kleinberg et al. [2008], Bubeck 
et al. [2011], Slivkins [2014], Magureanu et al. [2014] and Combes et al. [2017], as 
well as the book of Slivkins [2019]. 


Infinite-Armed Bandits 

Consider a bandit problem where in each round the learner can choose to play 
an arm from an existing pool of Bernoulli arms or to add another Bernoulli arm 
to the pool with mean sampled from a uniform distribution. The regret in this 
setting is defined as 


R,=n-E yo . 
t=1 


This problem is studied by Berry et al. [1997], who show that Rp = O(n1/?) is the 
optimal regret. There are now a number of strengthening and generalisations of 
this work [Wang et al., 2009, Bonald and Proutiere, 2013, Carpentier and Valko, 
2015, for example], which sadly must be omitted from this book. The notable 
difficulty is generalising the algorithms and analysis to the case where reservoir 
distribution from which the new arms are sampled is unknown and/or does not 
exhibit a nice structure. 


Duelling Bandits 

In the duelling bandit problem, the learner chooses two arms in each round 
An, Arg. Rather than observing a reward for each arm, the learner observes 
the winner of a ‘duel’ between the two arms. Let k be the number of arms and 
P € [0,1]*** be a matrix where P;; is the probability that arm i beats arm j in 
a duel. It is natural to assume that Pj; = 1 — P;;. A common, but slightly less 
justifiable, assumption is the existence of a total ordering on the arms such that 


360 


if i > j, then P,;; > 1/2. There are at least two notions of regret. Let i* be the 
optimal arm so that i* > j for all j 4 7*. Then the strong and weak regret are 
defined by 


Strong regret = E 
t=1 


> (Pit Au + Pit Ai — | ’ 


Weak regret = E Ss min {Pe An — 1/2, Pv A — 12 
t=1 
Both definitions measure the number of times arms with low probability of 
winning a duel against the optimal arm is played. The former definition only 
vanishes when Aj; = A2 = i*, while the latter is zero as soon as i* € {Ajgy, At2}. 
The duelling bandit problem was introduced by Yue et al. [2009] and has seen 
quite a lot of interest since then [Yue and Joachims, 2009, 2011, Ailon et al., 
2014, Zoghi et al., 2014, Dudik et al., 2015, Jamieson et al., 2015, Komiyama 
et al., 2015a, Zoghi et al., 2015, Wu and Liu, 2016, Zimmert and Seldin, 2019]. 


Convex Bandits 

Let A C R be a convex set. The convex bandit problem comes in both stochastic 
and adversarial varieties. In both cases, the learner chooses A; from A. In the 
stochastic case, the learner receives a reward X; = f(A) + m where f is an 
unknown convex function and m is noise. In the adversarial setting, the adversary 
chooses a sequence of convex functions f),..., fn and the learner receives reward 
Xı = fe( A). This turned out to be a major challenge over the last decade with 
most approaches leading to suboptimal regret in terms of the horizon. The best 
bounds in the stochastic case are by Agarwal et al. [2011], while in the adversarial 
case there has been a lot of recent progress [Bubeck et al., 2015a, Bubeck and 
Eldan, 2016, Bubeck et al., 2017]. In both cases the dependence of the regret on 
the horizon is O(./n), which is optimal in the worst case. Many open question 
remain, such as the optimal dependence on the dimension, or the related problem 
of designing practical low-regret algorithms. The interested reader may consult 
Shamir [2013] and Hu et al. [2016] for some of the open problems. 


Budgeted Bandits 

In many problems, choosing an action costs some resources. In the bandits-with- 
knapsacks problem, the learner starts with a fixed budget B € [0,00)? over 
d resource types. Like in the standard K-armed stochastic bandit, the learner 
chooses A; € [K] and receives a reward X; sampled from a distribution depending 
on A;. The twist is that the game does not end after a fixed number of rounds. 
Instead, in each round, the environment samples a cost vector C; € [0,1]? from a 
distribution that depends on A;. The game ends in the first round 7 for which 
there exists an i € [d] such that )°/_, Cri > Bi. This line of work was started by 
Badanidiyuru et al. [2013] and has been extended in many directions by Agrawal 
and Devanur [2014], Tran-Thanh et al. [2012], Ashwinkumar et al. [2014], Xia 
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et al. [2015], Agrawal and Devanur [2016], Tran-Thanh et al. [2010] and Hanawal 
et al. [2015]. A somewhat related idea is the conservative bandit problem where 
the goal is to minimise regret subject to the constraint that the learner must not 
be much worse than some known baseline. The constraint limits the amount of 
exploration and makes the regret guarantees slightly worse [Sui et al., 2015, Wu 
et al., 2016, Kazerouni et al., 2017]. 


Learning with Delays 

In many practical applications, the feedback to the learner is not immediate. The 
time between clicking on a link and buying a product could be minutes, days, 
weeks or longer. Similarly, the response to a drug does not come immediately. 
In most cases, the learner does not have the choice to wait before making the 
next decision. Buyers and patients just keep coming. Perhaps the first paper 
for online learning with delays is by Weinberger and Ordentlich [2002], who 
consider the full information setting. Recently this has become a hot topic, and 
there has been a lot of follow-up work extending the results in various directions 
[Joulani et al., 2013, Desautels et al., 2014, Cesa-Bianchi et al., 2016, Vernade 
et al., 2017, 2018, Pike-Burke et al., 2018, and others]. Learning with delays is 
an interesting example where the adversarial and stochastic models lead to quite 
different outcomes. In general the increase in regret due to rewards being delayed 
by at most 7 rounds is a multiplicative y7 factor for adversarial models and an 
additive term only for stochastic models. 


Graph Feedback 

There is growing interest in feedback models that lie between the full information 
and bandit settings. One way to do this is to let G be a directed graph with 
K vertices. The adversary chooses a sequence of loss vectors in [0,1] as usual. 
In each round, the learner chooses a vertex and observes the loss corresponding 
to that vertex and its neighbours. The full information and bandit settings 
are recovered by choosing the graph to be fully connected or have no edges 
respectively, but of course there are many interesting regimes in between. There 
are many variants on this basic problem. For example, G might change in each 
round or be undirected. Or perhaps the graph is changing, and the learner only 
observes it after choosing an action. The reader can explore this topic by reading 
the articles by Mannor and Shamir [2011], Alon et al. [2013], Kocák et al. [2014] 
and Alon et al. [2015] or the short book by Valko [2016]. 
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Combinatorial Bandits 


A combinatorial bandit is a linear bandit with an action set that is a subset 
of the d-dimensional binary hypercube: A C {0,1}¢. Elements of A are thus 
d-dimensional, binary-valued vectors. Each component may be on or off, but some 
combinations are not allowed — hence the combinatorial structure. Combinatorial 
bandit problems arise in many applications, some of which are detailed shortly. 

The setting is studied in both the adversarial and stochastic models. We focus 
on the former in this chapter and discuss the latter in the notes. In the adversarial 
setting, as usual, the environment chooses a sequence of loss vectors y1,.--,Yn 
with y, € R4, and the regret of the learner is 


Rn = max 
acA 


[E-a], 


where as usual A; is the action chosen by the learner in round t. 

Unsurprisingly, the algorithms and analysis from Chapters 27 and 28 are 
applicable in this setting. The main challenge is controlling the computation 
complexity of the resulting algorithms. As we will soon argue, except in special 
cases, it is natural to be hopeful when there exists an efficient optimisation 
oracle that computes the map y +> argminac 4 (a, y). The most important result 
of this chapter gives a strategy based on follow-the-perturbed-leader that 
makes a single call to such an optimisation oracle in every round for a suitable 
chosen vector Lr Ee RI (a perturbed estimate of the cumulative loss vector). This 
is done in the semi-bandit setting, an in-between setting where the learner 
receives semi-bandit feedback, which is the vector (Ajiyii,..., AtdYta). Since 
Ar € {0,1}, this is equivalent to observing y;; for all i for which An = 1. 

The rest of this chapter is organised as follows: the next section describes 
some additional useful notation. There follows a section that describes selected 
applications. In Section 30.3 we describe an application of Exp3 to the case 
when the learner receives only bandit feedback and explain the computational 
challenges that arise due to the combinatorial nature of the problem. Section 30.4 
explains how online stochastic mirror descent can be applied to the semi-bandit 
setting, which still fails to give an efficient algorithm. Finally, the follow-the- 
perturbed-leader algorithm is introduced and analysed in Section 30.5. 
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Notation and Assumptions 


In the applications, a key quantity associated with combinatorial action sets is 
the largest number of elements m that can be simultaneously ‘on’ in any given 
action: 


AC {a € {0,1}: |jallı <m}. 


In Chapters 27 and 28, we assumed that y+ E€ {y : supge, |(a,y)| < 1}. This 
restriction is not consistent with the applications we have in mind, so instead we 
assume that y; € [0,1]“, which by the definition of A ensures that |(A;, yp) | < m 
for all t. In the standard bandit model, the learner observes (A+, y+) in each round. 


Applications 


Shortest-Path Problems 

Let G = (V,E) be a fixed graph with a finite set of vertices V and edges 
ECV x V, with |E| = d. The online shortest-path problem is a game over n 
rounds between an adversary and a learner. Given fixed vertices u,v € V, the 
learner’s objective in each round is to find the shortest path between u and v. At 
the beginning of the game, the adversary chooses a sequence of vectors y1,.--, Yn; 
with ys € [0, 1]? and ys; representing the length of the ith edge in E in round t. 
In each round, the learner chooses a path between u and v. The regret of the 
learner is the difference between the distance they travelled and the distance of 
the optimal path in hindsight. A path is represented by a vector a € {0,1}¢ where 
a; = 1 if the ith edge is part of the path. Let A be the set of paths connecting 
vertices u and v, then the length of path a in round t is (a, yz). In this problem, 
m is the length of the longest path. Fig. 30.1 illustrates a typical example. 


Ranking 

Suppose a company has d ads and m locations in which to display them. In each 
round t, the learner should choose the m ads to display, which is represented by a 
vector A; € {0,1}4 with ||A;||; =m. As before, the adversary chooses y; € [0, 1]¢ 
that measures the quality of each placement and the learner suffers loss (Aj, y+). 
This problem could also be called ‘selection’ because the order of the items play 
no role. Problems where the order plays a direct role are analysed in Chapter 32. 


Multitask Bandits 

Consider playing m multi-armed bandits simultaneously, each with k arms. If 
the losses for each bandit problem are observed, then it is easy to apply Exp3 or 
Exp3-IX to each bandit independently. But now suppose the learner only observes 
the sum of the losses. This problem is represented as a combinatorial bandit by 
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Figure 30.1 Shortest-path problem between Budapest and Sydney. The learner chooses 
the path Budapest-Frankfurt-Singapore-Sydney. In the bandit setting, they observe 
total travel time (21 hours), while in the semi-bandit they observe the length of each 
flight on the route they took (1 hour, 12 hours, 8 hours). 


letting d = mk and 


k 
A= fac {0,1}4 : $7 aitz =1 foral 0< j <m} : 
i=1 
In words, the d coordinates are partitioned into m parts and the learner needs to 
select exactly one coordinate (“primitive action”) from each part. The resulting 
problem is called the multi-task bandit problem: This problem is like making 
m independent choices in parallel in m bandit problems blindly and then receiving 
an aggregated feedback for all the m choices made. This scenario can arise in 
practice when a company is making multiple independent interventions, but the 
quality of the interventions are only observed via a single change in revenue. 


Bandit Feedback 


The easiest approach is to apply the version of Exp3 for linear bandits described 
in Chapter 27. The only difference is that now |(A;,y,)| can be as large as m, 
which increases the regret by a factor of m. We leave the proof of the following 
theorem to the reader (Exercise 30.1). 


THEOREM 30.1. Consider the setting of Section 30.1. If Algorithm 15 is run on 
action set A with appropriately chosen learning rate, then 


Rn < 2my/3dn log |A| < m*/?,/12dn log (5) ; 
m 


There are two issues with this approach, both computational. First, the action 
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set is typically so large that finding the core set of the central minimum volume 
enclosing ellipsoid that determines the Kiefer—-Wolfowitz exploration distribution 
of Algorithm 15 is hopeless. Second, efficiently sampling from the resulting 
exponential weights distribution may not be possible. There is no silver bullet 
for these issues. The combinatorial bandit can model a repeated version of the 
travelling salesman problem, which is hard even to approximate. Since an online 
learning algorithm with O(n”) regret with p < 1 can be used to approximate 
the optimal solution, it follows that no such algorithm can be computationally 
efficient. There are, however, special cases where efficient algorithms exist, and 
we give some pointers to the relevant literature on this at the end of the chapter. 
One modification that greatly eases computation is to replace the optimal Kiefer- 
Wolfowitz exploration distribution with a distribution that can be computed and 
sampled from in an efficient manner, as noted after Theorem 27.1. 


Semi-bandit Feedback and Mirror Descent 


In the semi-bandit setting, the learner observes the loss associated with all non- 

zero coordinates of the chosen action. The additional information is exploited by 

noting that y can now be estimated in each coordinate. Let 
7 AtiYti 


Yri = = > 
"O Au 


(30.1) 


where An = UAn | Fz-1] with Fi = o(Ai,..., A). An easy calculation shows 
that E[Y; | F:-1] = y+, so this estimate is still unbiased. Unsurprisingly we will 
again use online stochastic mirror descent, which is summarised for this setting 
in Algorithm 18. 


: Input A, 7, F 
: Ay = argmingecos) F (a) 
: for t= 1,...,n do 
Choose distribution P; on A such that X „e4 P:(a)a = At 
Sample A; ~ P, and observe Aty, . - -, AtdYta 
Compute Y;; = AniYri/Art for all i € [d] 
Update Arı = argmingeco(a) (4, Y;) + Dr(a, Å) 
end for 


D a ae e go o a 


Algorithm 18: Online stochastic mirror descent for semi-bandits. 


THEOREM 30.2. Consider the setting of Section 30.1. Let F : R¢ > R be the 
unnormalised negentropy potential: 
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for a € [0,00)4 and F(a) = œ otherwise. Then, Algorithm 18 is well defined and, 
provided n = \/2m(1 + log(d/m))/(nd), its regret Rn satisfies 


Ry < V2nmd(1 + log(d/m)). 


Proof Since A is a finite set, the algorithm is well defined. In particular, A> 
exists and is unique for all t € [n]. By Theorem 28.10, 

no E . 1 7 7 
SUA = Agi, Ya) = pe AnA) 


t=1 


R, < diamp(co(A)) LE 
1) 


(30.2) 


The diameter is easily bounded by noting that F is negative in co(.A) and using 
Jensen’s inequality: 


1 

diam p(co(A)) < sup 5 (o + a; log (= )) < m(1 + log(d/m)). 
acco( A) 5-4 Qi 

For the second term in Eq. (30.2), let Ê’, = Yul {Aria < Aj}. Since Y; is 


positive, 


ee n 1 _ F Se P 1 _ _ 
(At — Atgi, Yt) — a TA At) < (Ar — Atgi, Y/) — gor Ai At) 


d 
<” 
= AEAN Zi)- a 


where Z: is provided by Theorem 26.12 and lies on the chord (Ag, Agel: The 
a inequality follows because V?F'(z) = diag(1/z) and using the definition of 

Y/, which ensures that the worst case occurs when Z; = A+. Summing and taking 
the expectation: 


eaa 


eee 1 siy Ati ma 
f S (Ar - Atsi, Yi) — yr Att Ad) SË SSE S 
t=1 t=1 i=1 
Putting together the pieces shows that 
1+ log(d 
po LE E a es 


n 2 


Algorithm 18 plays mirror descent on the convex hull of the actions, which has 
dimension d— 1. In principle it would be possible to do the same thing on the 
set of distributions over actions, which has dimension |A| — 1. Repeating the 
analysis leads to a suboptimal regret of O(m,/dnlog(d/m)). We encourage 
the reader to go through this calculation to see where things go wrong. 


Like in Section 30.3, the main problem is computation. In each round the algorithm 
needs to find a distribution P, over A such that } c4 P:(a) = A. Feasibility 
follows from the definition of co( A), while Carathéodory’s theorem proves the 
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support of P; never needs to be larger than d+1. Since A is finite, we can write the 
problem of finding P; in terms of linear constraints, but naively the computation 
complexity is polynomial in k = |A|, which is exponential in m. The algorithm also 
needs to compute Any from A; and Y;. This is a convex optimisation problem, 
but the computation complexity depends on the representation of A and may be 
intractable. See Note 6 for a few more details on this. 


Follow-the-Perturbed-Leader 


In this section, we help ourselves to find a computationally efficient algorithm 
by adding the assumption that for all y € [0,00)%, the optimisation problem of 
finding 


a* = argmin,. 4(a, y) (30.3) 


admits a computationally efficient algorithm. This assumption feels close to the 
minimum one could get away with in the sense that if the offline problem in 
Eq. (30.3) is hard to approximate, then any algorithm with low regret must also 
be inefficient. A marginally more reasonable assumption is that Eq. (30.3) can be 
approximated efficiently. For simplicity we assume exact solutions, however. 

If the algorithm observed the losses in every round, adding a random vector to 
the sum of previous losses and then finding the action that minimises the total 
randomly perturbed loss leads to what is known as the follow-the-perturbed- 
leader (FTPL) algorithm. As discussed before, the random perturbation is 
necessary to achieve sublinear regret. In semi-bandit setting which is considered 
here, the full loss vector is unobserved and hence needs to be estimated. Letting 
Îi = va Y, be the cumulative loss estimates before round t, FTPL chooses 


A; = argmin,.e 4 (a, nii — Ze), (30.4) 


where 7) > 0 is the learning rate and Z; € R? is sampled from a carefully chosen 
distribution Q. The random perturbations is chosen to both guard against worst- 
case, and to induce necessary exploration. Notice that if 7 is small, then the 
effect of Z is larger and the algorithm can be expected to explore more, which is 
consistent with the learning rate used in mirror descent or exponential weighting 
studied in previous chapters. 

Before defining the loss estimations and perturbation distribution, we make a 
connection between FTPL and mirror descent. Given Legendre potential F with 
dom(VF) = int(co(A)), online stochastic mirror descent chooses A; so that 


At = argminacco( A) (a, nYt-1) Tr Dr(a, A1) $ 
Taking derivatives and using the fact that dom(V F) = int(co(A)), we have 


VF(At) = VF(Ai1) — f1 = -NÊ . 
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By duality (Theorem 26.6), this implies that A, = VF*(—n£;_1). On the other 
hand, examining Eq. (30.4), we see that for FTPL, 


A; = E[A; | F:-1] = E [argminge 4 (a, nlii — Zi) 


Fal, 


where F; = o(Z,,...,Z;). Thus, in order to view FTPL as an instance of mirror 
descent, it suffices to find a Legendre potential F with dom(V F) = int(co(A)) 
and 


VF*(—nL 1)=E [aremingea(a, nlii — Za) 


Fi] 


=E [argmaxaea(a, Zi — nLt-1) |F] . 


Since L;—1 is more or less uncontrolled, the latter condition is most easily satisfied 
by requiring that for any z € R, VF* (x) = fga argmaxacco(a) (0, + z) dQ(z). 
To remove clutter in the notation, define 


a(x) = argmax,e 4 (4, £), 


where a(x) is chosen to be an arbitrary maximiser if multiple maximisers exist. 
Readers with some familiarity with convex analysis will remember that if a convex 
set A has a smooth boundary, then the support function of A, 


(z) = max(a, z), 


satisfies V(x) = a(x). For combinatorial bandits, A is not smooth, but if Q is 
absolutely continuous with respect to the Lebesgue measure, then you will show 
in Exercise 30.5 that 


= AE z z or all £T a 
vf o@e+2aae =f kedda iedeen 


The key to this argument is that the derivative of @ exists almost everywhere 
and is equal to a(x). All this shows is that FTPL can be interpreted as mirror 
descent with potential F defined in terms of its Fenchel dual, 


F*(x) = i olx + z)dQ(z). (30.5) 


Of course we have not shown that F is Legendre or that int(dom(F’)) = int(co(A)), 
both of which you will do in Exercise 30.6 under appropriate conditions on Q. 

There are more reasons for making this connection than mere curiosity. The 
classical analysis of FTPL involves at least one ‘leap of faith’ in the analysis. In 
contrast, the analysis via the mirror descent interpretation is more mechanical. 
Recall that mirror descent depends on choosing a potential, an exploration 
distribution and an estimator. We now make the choice of these explicit. The 
exploration distribution is a distribution P, on A such that 


A; = 5 Pi(aja, 


acA 
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which in our case is implicitly defined by the distribution of Z;: 
P,(a) = P(a(Zı = ni1) =a | Fi-1) . 


It remains to choose the loss estimator. A natural choice would be the same as 
Eq. (30.1), which is Yi, = Aniyn/Pri with Pa = P (An = 1| F1) = Ari. The 
problem is that P;; does not generally have a closed-form solution. And while 
P, can be estimated by sampling, the number of samples required for sufficient 
accuracy can be quite large. The next idea is to replace 1/P,; in the importance- 
weighted estimator with a random variable with conditional expectation equal to 
1/Pi;. This is based on the following well-known result: 


LEMMA 30.3. Let U € {1,2,...} be geometrically distributed with parameter 
6 € [0,1] so that P (U = j) = (1—0)3-10. Then E[U] = 1/0. 


You can sample from a geometric distribution with parameter 0 by counting 
the number of flips of a biased coin with bias 0 until the first head. That is, 
if (X+); is an independent sequence of Bernoulli random variables with 
bias 0, then U = min{t > 1 : X; = 1} is geometrically distributed with 
parameter 0. 


Define a sequence of d-dimensional random vectors K1,..., Kn, where (Ky;)“, 
is a sequence of geometric random variables that are conditionally independent 
given F; so that the conditional law of K given F; is Geometric(P,;) and where 
we now redefine F; = 0( 2), K,..., 24-1, Kt—1, Z+). The estimator of ys can now 
be defined by 


Yu; = min(8, Kei) AtiYti , 


where ( is a positive integer to be chosen subsequently. Note that 


o [Kri AtiYti | Fi—1] = Yti - 


The truncation parameter ĝ is needed to ensure that a; is never too large. We 
have now provided all the pieces to define a version of FTPL that is a special 
case of mirror descent. The algorithm is summarised in Algorithm 19. 


THEOREM 30.4. Consider the setting of Section 30.1. Let Q have density with 


respect to the Lebesgue measure of q(z) = 2~¢exp(—||z||1), and choose the 
parameters n, 3 as follows: 
_ /2(1 + log(d)) B= 1 
= (1+ e?)dnm ’ | nm | 


Then the algorithm Algorithm 19 is well defined and provided that nm > 1 its 
regret is bounded by Rn < my/2(1 + e?)nd(1 + log(d)). 
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1: Input A, n,n, 8, Q 

2: Îo =0€ RI 

3: for t= 1,...,n do 

4 Sample Z ~ Q 

5: Compute A; = argmax,¢ 4 (4, Zt — nli) 

6 Observe Aji Yi1,---; AtaYta 

7 For each i € |d] sample Ku ~ Geometric(P,;) 

8 For each i € [d] compute Y;; = min(8, Kui) Auyei 
9: Î, = bee 1 + ¥; 

10: end for 


Algorithm 19: Follow-the-perturbed-leader for semi-bandits. 


Proof First, note that A; is almost surely uniquely defined and so is A; = 
E [Ay | 74-1]. Therefore, by isolating the bias in the loss estimators, and thanks 
to Exercise 30.6, we can apply Theorem 28.4 to get that 


t=1 


|X (A — a, y) n)| = = 
t=1 
[Ea -a| + 


< diam p (A) 4 


: |X (Ai = au) 


|X (4 7a Yi) 
t=1 


a 


3 


ines = a 
XO Dr(At, A) 


+E |X (A —a,y —%) 
t=1 


0 = 
(30.6) 
Of the three terms, the diameter is most easily bounded. For Z ~ Q, 
F(a) = sup ({a,2) — F"(2)) = sup ((a, x) — Efmax(6, x + Z)]) (30.7) 


d 
> —Elmax(), Z)] > —mE[| Zll] = =m D> 5 = —m(1 + log(d)), 


where the first inequality follows by choosing x = 0 and the second follows 
from Holder’s inequality and that |la||,; < m for any a € A. The last equality is 
non-trivial and is explained in Exercise 30.4. By the convexity of the maximum 
function and the fact that Z is centered, we also have from Eq. (30.7) that 
F(a) < 0, which means that 


diam (A) = max F(a) — F(b) < m(1 + log(d)). (30.8) 


The next step is to bound the Bregman divergence induced by F. We will shortly 
show that the Hessian V?F*(x) of F* exists, so by Part (b) of Theorem 26.6 
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and Taylor’s theorem, there exists an a € [0,1] and € = —nLy_1 — anY; such that 


Dr(Aj, Ati) = Dr- (VF (A1), VF(A;)) 


2 
= Dp-(-nbia—n¥i,—nbia) = FPB (80.9) 
where the last equality follows from Taylor’s theorem (see Theorem 26.12). To 
calculate the Hessian, we use a change of variable to avoid applying the gradient 


to the non-differentiable argmax: 


V? F* (x) = V(VF*(£)) = VE [a(x + Z)| = V a a(x + z)q(z)dz 


=V | a(u)q(u—2)du = f a(u)(Vq(u — x£))' du 
Rd Ra 


= f a(u) sign(u — x)! q(u — £)du = ip a(x + z) sign(z) ' ¢(z)dz. 
Ra Ra 


Using the definition of € and the fact that a(x) is non-negative, 


VF" (é)ij = 1 a(€ + z): sign(z)jq(2)dz (30.10) 


Rd 


< f aE + 2)eal2de 


= | a(z — nlii — any;)iq(z)dz 
Rd 


= f aļu — nLi—1)iq(u + anY;)du 
Ra 


< exp (lanii) [alu nisa)salwdu 
< e Pa, (30.11) 


where the last inequality follows since a € [0,1] and Êu < 8 = [1/(mn)], nm > 1 


and Ŷ, has at most m non-zero entries. Continuing on from Eq. (30.9), we have 


22 2m2 


9 d d d d 
T ie e n e n 
z lölli < <a >»: Pu; Ys Sao 5 5 Pa KuAuKijAtj. 


i=1 i=1 j=1 


Chaining together the parts and taking the expectation shows that 


en |< 
a XOY Pik Auk Ay 


i=1 j=1 


UDr (Ar, At+1)| 


IA 


2,2 d d 2 2 
en AtiAtj efmdn 
= —— E J < 7 


Pij ~ 2 
i=1j=1 H 


The last step is to control the bias term. For this, first note that since 
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Ariss € {0,1}, 


E[Yis | F] = Elmin(6, Kui) Atiyri | Fi] = AriyuElmin(b, Kui) | Fa 


where the last equality follows from the definition of Ky; using a direct calculation. 
Thus, OY ti Fi-1| = (1 as (1 es Pui)? yti and 


5 S (A-a, ue — Pi) <E XO (At, ve — Ĉi) 
t=1 t=1 
TA dn dnm 
=E |XX yuPall — Paf | <= l 
t=1 i=1 28 2 


where the last inequality follows from using that for x € [0,1], s > 0, 
z(1— x)? < we~*” < 1/s. Putting together all the pieces into Eq. (30.6) leads to 


Rn < my/2(1 + e?)nd(1 + log(d)). 


< mU + log(d)) | ednmn  dnmn 
ce n T 2 T 2 


Notes 

1 For a long time, it was speculated that the dependence of the regret on m3/2 
in Theorem 30.1 (bandit feedback) might be improvable to m. Very recently, 
however, the lower bound was increased to show the upper bound is tight 
[Cohen et al., 2017]. For semi-bandits the worst-case lower bound is Q(v dnm) 
(Exercise 30.8), which holds for large enough n and m < d/2 and is matched 
up to constant factors by online stochastic mirror descent with a different 
potential (Exercise 30.7). 


bo 


The implementation of FTPL shown in Algorithm 19 needs to sample K+ for 
each i with An = 1. The conditional expected running time for this is Ay;/ Pi, 
which has expectation 1. It follows that the expected running time over the 
whole n rounds is O(nd) calls to the oracle linear optimisation algorithm. It 
can happen that the algorithm is unlucky and chooses A;; = 1 for some i 
with P,; quite small and then sampling K+; could be time-consuming. Note, 
however, that only min(K;;, 3) is actually used by the algorithm, and hence 
the sampling procedure can be truncated at 8. This minor modification ensures 
the algorithm needs at most O(@nd) calls to the oracle in the worst case. 


ew 


While FTPL is excellent in the face of semi-bandit information, we do not know 
of a general result for the bandit model. The main challenge is controlling the 
variance of the least squares estimator without explicitly inducing exploration 
using a sophisticated exploration distribution like what is provided by Kiefer— 
Wolfowitz. 
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4 Combinatorial bandits can also be studied in a stochastic setting. There are 
several ways to do this. The first mirrors our assumptions for stochastic linear 
bandits in Chapter 19, where the loss (more commonly reward) is defined by 


Xt = (Ab, 0) + nt 5 (30.12) 


where 6 € Rê is fixed and unknown and 7 is the noise on which statistical 
assumptions are made (for example, conditionally 1-subgaussian). There are 
at least two alternatives. Suppose that 01,...,0, are sampled independently 
from some multivariate distribution, and define the reward by 


Xi = (Anba. (30.13) 


This latter version has ‘parameter noise’ (cf. Chapter 29) and is more closely 
related to the adversarial set-up studied in this chapter. Finally, one can assume 
additionally that the distribution of 6, is a product distribution so that (61;)“, 
are also independent. 


5 For some action sets, the off-diagonal elements of the Hessian in Eq. (30.10) 
are negative, which improves the dependence on m to ym. An example where 
this occurs is when A = {a € {0,1}: |la||, = m}. Let i 4 j, and suppose that 
z,€ € R? and zj > 0. Then you can check that a(z + £); < a(z — 2zjej + Ei, 
and so 


VE" = fale + Oisign(2)ya(2)de 


R 


_ f f e E E eae: 
R4å-1 0 
<0 


where dz_,; is shorthand for dz,dzo,...dzj;-1dzj41,...,dzq. You are asked to 
complete all the details in Exercise 30.9. This result unfortunately does not 
hold for every action set (Exercise 30.10). 


6 In order to implement mirror descent or follow-the-regularised-leader with 
bandit or semi-bandit information, one needs to solve two optimisation problems: 
(a) a convex optimisation problem of the form argminge,o(4) F(a) for some 
convex F and (b) a linear optimisation problem to find a distribution P over A 
with mean a@ where a € co( A). More or less sufficient is an efficient membership 
oracle for co(.A) and evaluation oracle for F [Grétschel et al., 2012, Lee et al., 
2018]. Also necessary for bandits is to identify an exploration distribution, 
which we discuss in the notes and bibliographic remarks of Chapter 27. This is 
not required for semi-bandits, however, at least with the negentropy potential 
used in by Algorithm 18. 
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Bibliographic Remarks 


The online combinatorial bandit was introduced by Cesa-Bianchi and Lugosi 
[2012], where you will also find the most comprehensive list of known applications 
for which efficient algorithms exist. The regret bound for Exp3 given in 
Theorem 30.1 for the bandit case is due to Bubeck and Cesa-Bianchi [2012] 
(with a slightly different argument). While computational issues remain in the 
bandit problem, there has been some progress in certain settings. Combes et al. 
[2015b] propose playing mirror descent on the convex hull of the action set without 
fancy exploration, which leads to near-optimal bounds for well-behaved action 
sets. One could also use continuous exponential weights from Chapter 27. These 
methods lead to computationally efficient algorithms for some action sets, but this 
must be checked on a case-by-case basis. The full information setting has been 
studied quite extensively [Koolen et al., 2010, and references from/to]. FTPL was 
first proposed (in the full information context) by Hannan [1957], rediscovered 
by Kalai and Vempala [2002, 2005] and generalised by Hutter and Poland [2005]. 
Poland [2005] and Kujala and Elomaa [2005] independently applied FTPL to finite- 
armed adversarial bandits and showed near-optimal regret for this case. Poland 
[2005] also proposed to use Monte Carlo simulation to estimate the probability 
of choosing each arm needed in the construction of reward estimates. Kujala and 
Elomaa [2007] extended the result to non-oblivious adversaries. For combinatorial 
settings, suboptimal rates have been shown by Awerbuch and Kleinberg [2004], 
McMahan and Blum [2004] and Dani and Hayes [2006]. Semi-bandits seem to have 
been introduced in the context of shortest-path problems by György et al. [2007]. 
The general set-up and algorithmic analysis of FTPL presented follows the work 
by Neu [2015a], who also introduced the idea to estimate the inverse probabilities 
via a geometric random variable. Our analysis based on mirror descent is novel. 
The analysis follows ideas of Abernethy et al. [2014], who present the core ideas in 
the prediction with expert advice setting, Cohen and Hazan [2015], who consider 
the combinatorial full information case, and Abernethy et al. [2015], who study 
finite-armed bandits. The literature on stochastic combinatorial semibandits 
is also quite large with algorithms and analysis in the frequentist [Gai et al., 
2012, Combes et al., 2015b, Kveton et al., 2015b] and Bayesian settings [Wen 
et al., 2015, Russo and Van Roy, 2016]. These works focus on the case where the 
reward is given by Eq. (30.13) and the components of 6; are independent. When 
the reward is given by Eq. (30.12), one can use the tools for stochastic linear 
bandits developed in Part V. Some work also pushes beyond the assumption 
that the rewards are linear [Chen et al., 2013, Lin et al., 2015, Chen et al., 
2016a,b, Wang and Chen, 2018]. The focus in these works is on understanding 
what are the minimal structural assumptions on the reward function and action 
spaces for which learning in combinatorially large action spaces is still feasible 
statistically /computationally. Last of all, we mentioned that travelling salesman 
is computationally hard to approximate, which you can read about in the paper 
by Papadimitriou and Vempala [2006], and references there-in. 
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Exercises 


30.1 (MIRROR DESCENT FOR COMBINATORIAL BANDITS) Prove Theorem 30.1. 


Hint For the second inequality, you may find it useful to know that for 
0<m<n, defining ®,,(n) = 07" ("), it holds that (m/n)™®,,(n) < e™. 
30.2 (EFFICIENT COMPUTATION ON m-SETS) Provide an efficient implementation 
of Algorithm 18 for the m-set: A = {a € {0,1}4: |ja||, = m}. 


30.3 (EFFICIENT COMPUTATION ON SHORTEST-PATH PROBLEMS) Playing mirror 
descent on co(A) leads to a good bound for bandit or semi-bandit problems, but 
sometimes playing Exp3 over A is more efficient, even when A is exponentially 
large. Design and analyse a variant of Exp3 for the online shortest-path problem 
with semi-bandit feedback described in Section 30.2. Your challenge is to ensure 
the following: 


(a) a regret of R, = O(./n), with dependence on d and m omitted; and 
(b) polynomial computation complexity in n and d. 


HINT This is not the easiest exercise. Start by reading the paper by Takimoto 
and Warmuth [2003], then follow up with that of Gyorgy et al. [2007]. 


30.4 (EXPECTED SUPREMUM NORM OF LAPLACE) Let Z be sampled from 
measure on R? with density f(z) = 2~¢exp(—||z||1). The purpose of this exercise 
is to show that 


d 
, 1 
[IZ lle] =) 5 (30.14) 
i=1 
(a) Let X1,..., Xa be independent standard exponentials. Show that ||Z||.. and 
max{X,,...,Xa} have the same law. 


(b) Let M; = max;<; Xi. Prove for j > 2 that 


L[M;] = E[M;-1] + Elexp(—Mj-1)]. 


(c) Prove by induction or otherwise that for all a,j € {1,2,...}, 


a! 


G+ 


L[exp(—aM;)] 


(d) Prove the claim in Eq. (30.14). 


30.5 (GRADIENT OF EXPECTED SUPPORT FUNCTION) Let A C Rt be a compact 
set and ọ(£) = maxge4(a, £) its support function. Let Q be a measure on R? that 
is absolutely continuous with respect to the Lebesgue measure, and let Z ~ Q. 
Show that 


VE[|ọ(x + Z)] = Elargmax,. 4 (a, £ + Z)] . 


HInT Recall that the support function ¢ of a non-empty compact set is a proper 
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convex function. Then, note that for any proper convex function f : R + RU{oo}, 
the set R? \ dom(V f) has Lebesgue measure zero [Rockafellar, 2015, Theorem 
25.5]. Next, by Danskin’s theorem, the directional derivative of ¢ in the direction 
v € R? is given by V,¢(x) = maXac A(x) (a, v}, where A(z) is the set of maximisers 
of a> (a,x) over A [Bertsekas, 2015, Proposition 5.4.8 in Appendix B]. Finally, 
it is worth remembering the following result: let f be an extended real-valued 
function with x € R? in the interior of its domain. Then, for some g € R4, 
V, f(x) = (g,v) holds true for all v € R? if and only if V f(a) exists and is equal 
to g. 


30.6 A function f : R? > R is closed if its epigraph is a closed set. Let F* be 
the function defined in Section 30.5 and F be the proper convex closed function 
and whose Fenchel dual is F*. 


(a) Show that the function F is well defined (F* is the Fenchel dual of a proper 
convex closed function, and there is only a single such function). 

(b) For the remainder of the exercise, let Q be absolutely continuous with respect 
to the Lebesgue measure with an everywhere positive density, and let A be 
the convex hull of finitely many points in R? whose span is R?. Show that 
the function F is Legendre. 

(c) Show that int(dom(F’)) = int(co(A)). 


Hint For Part (a), it may be worth recalling that the bidual (the dual of the 
dual) of a proper convex closed function f is itself: f = f**(= (f*)*). Furthermore, 
the Fenchel dual of a proper function is always a proper convex closed function. 


30.7 (MINIMAX BOUND FOR COMBINATORIAL SEMI-BANDITS) Adapt the analysis 
in Exercise 28.15 to derive an algorithm for combinatorial bandits with semi- 
bandit feedback for which the regret is R, < C/mdn for universal constant 
C>0. 


30.8 (LOWER BOUND FOR COMBINATORIAL SEMI-BANDITS) Let m > 1 and 
d = km for some k > 1. Prove that for any algorithm there exists a combinatorial 
semi-bandit such that R, > cmin{nm, vmdn} where c > 0 is a universal 
constant. 


Hint The most obvious choice is A = {a € {0,1} : |jal|ı = m}, which are 
sometimes called m-sets. A lower bound does hold for this action set [Lattimore 
et al., 2018]. However, an easier path is to impose a little additional structure 
such as multi-task bandits. 


30.9 (FOLLOW-THE-PERTURBED-LEADER FOR m-SETS) Use the ideas in Note 5 to 
prove that FTPL has R, = O(Wmnd) regret when A = {a € {0,1}4: Jali = m}. 


Hint After proving the off-diagonal elements of the Hessian are negative, you 
will also need to tune the learning rate. We do not know of a source for this 
result, but the full information case was studied by Cohen and Hazan [2015]. 
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30.10 Construct an action set and i 4 j and z € R? with z; > 0 such that 
a(z); > a(z — 2zjej)i. 


HINT Consider the shortest-path problem defined by the graph below. 


i j 
start eee goal 


Choose losses for the edges z, and think about what happens when the loss 
associated with edge j decreases. 
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31.1 


Non-stationary Bandits 


The competitor class used in the standard definition of the 
regret is not appropriate when the underlying environment 
is changing. In this chapter we increase the power of 
the competitor class to ‘track’ changing environments 
and derive algorithms for which the regret relative to 
this enlarged class is not too large. While the results 
are specified to bandits with finitely many arms (both 
stochastic and adversarial), many of the ideas generalise to 
other models such as linear bandits. This chapter also 
illustrates the flexibility of the tools presented in the 
earlier chapters, which are applied here almost without 
modification. We hope (and expect) that this will also be 
true for other models you might study. 


Figure 31.1 This ban- 
dit is definitely not sta- 
tionary! 


Adversarial Bandits 


In contrast to stochastic bandits, the adversarial bandit model presented in 
Chapter 11 does not prevent the environment from changing over time. The 
problem is that bounds on the regret can become vacuous when the losses appear 
non-stationary. To illustrate an extreme situation, suppose you face a two-armed 
adversarial bandit with losses y,, = I {t < n/2} and yo =1{t > n/2}. If we run 
Exp3 on this problem, then Theorem 11.2 guarantees that 


Ry =E — mi i < /2nklog(k). 
Sua gain, Dove nk log(k) 


t=1 t=1 


Since min;e{1,2} 4-1 Yti = n/2, by rearranging we see that 


z p va < Z + V2nklog(k) 
tel 


To put this in perspective, a policy that plays each arm with probability half in 
every round would have E|) ;—; yza,] = n/2. In other words, the regret guarantee 
is practically meaningless. 

What should we expect for this problem? The sequence of losses is so regular 
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that we might hope that a clever policy will mostly play the second arm in the 
first n/2 rounds and then switch to playing mostly the first arm in the second 
n/2 rounds. Then the cumulative loss would be close to zero and the regret would 
be negative. Rather than aiming to guarantee negative regret, we redefine the 
regret by enlarging the competitor class as a way to ensure meaningful results. 
Let Pnm C [k]” be the set of action sequences of length n with at most m — 1 
changes: 


Tan = Q € [k]”: 5 Ha # aun} sma} ; 


Then define the non-stationary regret with m — 1 change points by 


n n 
Rnm =E 5 YtAt | — min E X Ytar| - 
aELlnm 
t=1 t=1 


The non-stationary regret is sometimes called the tracking regret because a 


learner that makes it small must ‘track’ the best arm as it changes. Notice 
that Rnı coincides with the usual definition of the regret. Furthermore, on the 
sequence described at the beginning of the section, we see that 


Rnz = S va ry 
t=1 


which means a policy can only enjoy sublinear non-stationary regret if it detects 
the change point quickly. The obvious question is whether or not such a policy 
exists and how its regret depends on m. 


Exp4 for Non-stationary Bandits 

One idea is to use the Exp4 policy from Chapter 18 with a large set of experts, 
one for each a € Py. Theorem 18.1 shows that Exp4 with these experts suffers 
regret of at most 


Ram < V2nklog |Pnm| - (31.1) 


Naively bounding log |L nm] (Exercise 31.1) and ignoring constant factors shows 


that 
kn 
Ram = O ( nmk log (*)) ; (31.2) 
m 


To see that you cannot do much better than this, imagine interacting with m 
adversarial bandit environments sequentially, each with horizon n/m. No matter 
what policy you propose, there exist choices of bandits such that the expected 
regret suffered against each bandit is at least Q(,/nk/m). After summing over 
the m instances, we see that the worst-case regret is at least 


Rue =O (Vnmk) l 
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which matches the upper bound except for logarithmic factors. Notice how this 
lower bound applies to policies that know the location of the changes, so it is 
not true that things are significantly harder in the absence of this knowledge. 
There is one big caveat with all these calculations. The running time of a naive 
implementation of Exp4 is linear in the number of experts, which even for modestly 
sized m is very large indeed. 


Online Stochastic Mirror Descent 
The computational issues faced by Exp4 are most easily overcome using the 
tools from online convex optimisation developed in Chapter 28. The idea is to 
use online stochastic mirror descent and the unnormalised negentropy potential. 
Without further modification, this would be Exp3, which you will show does not 
work for non-stationary bandits (Exercise 31.3). The trick is to restrict the action 
set to the clipped simplex A = Py_1M[a,1]* where a € [0,1/k] is a constant to 
be tuned subsequently. The clipping ensures the algorithm does not commit too 
hard to any single arm. The rationale is that a strong commitment could prevent 
the discovery of change points. 

Let F : [0,00)* > R be the unnormalised negentropy potential and P, € A be 
the uniform probability vector. In each round t, the learner samples A; ~ P, and 
updates its sampling distribution using 


Prat = argmin eA np, Y;) T Dr(p, Pr) 3 (31.3) 
where 7 > 0 is the learning rate and Y;; = I {A; = i} yu: / Py is the importance- 
weighted estimator of the loss of action i for round t. The solution to the 


optimisation problem of Eq. (31.3) can be computed efficiently using the two-step 
process: 


P1 = argminyeo,co)t nlp, Ýi) + Dr (p, Pt) , 


Pr41 = argminpea Dr(p, Paa). 


The first of these sub-problems can be evaluated analytically, yielding Baii = 
P, exp(—nYr). The second can be solved efficiently using the result in 
Exercise 26.12. The algorithm enjoys the following guarantee on its regret: 


THEOREM 31.1. The expected regret of the policy sampling A, ~ P, with P; 
defined in Eq. (31.3) is bounded by 


Ram < an(k —1)+ 


Proof Let a* € argminger,, re Yta, be an optimal sequence of actions in 
hindsight constrained to [,. Then let 1 = tı < tg < +--+ < tm <tm41=n+1 
so that až is constant on each interval {t;,...,ti;1 — 1}. We abuse notation by 


31.2 Stochastic Bandits 381 


writing a; = až. Then the regret decomposes into 


Ram = | Soa - ne) =E Is È (Year = Yta; ] 


i=1 t=t; 


titi—1 
, | do ta. — Yeas) 


t=t; 


P, 


i 


The next step is to apply Eq. (28.11) and the solution to Exercise 28.10 to bound 
the inner expectation, giving 


tipimi ti+ı1—1 
| 5 (Yea. — Year ) P, | = | 5 (P; — eat, Yt) Pi 
t=t; tat; 
ti4i—1 
< alti -ti(k —1)+E |max D> (P; — p, y) | Pu 
pEA ram 


ti41—1 
= a(tiz1 — ti)(k — 1) + E | max (P; — p, Ýi) | Pr, 
pEA t=t; 
D(p, Pi), nk(tita — ti) 
<a(tj41 —ti)(k-1) +E v4 Pil. 
< altis =t) = 1) +E |g 2E R 


By assumption, P,, € A and so P;,; > a for all j and D(p, P) < log(1/a). 


Combining this observation with the previous two displays shows that 


mlog(1/a) mk 
n 2 
The learning rate and clipping parameters are approximately optimised by 


n = \/2mlog(1/a)/(nk) and a = ym/ (nk), 


which leads to a regret of Ram < ymnklog(nk/m) + vmnk. In typical 
applications, the value of m is not known. In this case one can choose 7 = 


/log(1/a)/nk and a = ,/1/nk, and the regret increases by a factor of O(./m). 


Ram < na(k —1)4 


31.2 Stochastic Bandits 


To keep things simple, we will assume the rewards are Gaussian and that for 
each arm i there is a function 4; : [n] > R, and the reward is 


Xt = HA, (t) F Nt, 


where (7)/_1 is a sequence of independent standard Gaussian random variables. 
The optimal arm in round t has mean p*(t) = maxje,4) W(t) and the regret is 


Ralu) = So e(t) - pam) 
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The amount of non-stationarity is modelled by placing restrictions on the functions 
li : [n] > R. To be consistent with the previous section, we assume the mean 
vector changes at most m — 1 times, which amounts to saying that 


n—-1 
> max {ui(t) Fui(tt+l}<sm-1. 
t=1 


If the locations of the change points were known then, thanks to the concavity of 
log, running a new copy of UCB on each interval would lead to a bound of 


i m 


mın 


Feu) =O (m+ mu toe (Z) l (31.4) 


where Amin is the smallest suboptimality gap over all m blocks and n > m. This 
is a non-vacuous bound for n large. Inspired by the results of the last section 
that showed that the bound achieved by an omniscient policy that knows when 
the changes occur can be achieved by a policy that does not, one then wonders 
whether the same holds concerning the bound in Eq. (31.4). As it turns out, the 
answer in this case is no. 


THEOREM 31.2. Let k = 2, and fiz A € (0,1) and a policy x. Let u be so that 
Lilt) = ui is constant for both arms and A = p — uo > 0. If the expected regret 
Rn(u) of policy 7 on bandit u satisfies Ry(u) = o(n), then for all sufficiently 
large n, there exists a non-stationary bandit u’ with at most two change points 
and Minej] Hilt) — ua (t)| => A such that R, (py) > n/(22R, (1). 


The theorem implies that if a policy enjoys Rn(u) = o(n'/?) for any non-trivial 
(stationary) bandit, then its minimax regret is at least w(n!/?) on some non- 
stationary bandit. In particular, if R,(j) = O(log(n)), then its worst-case regret 
against non-stationary bandits with at most two changes is at least Q(n/log(n)). 
This dashes our hopes for a policy that outperforms Exp4 in a stochastic setting 
with switches, even in an asymptotic sense. The reason for the negative result 
is that any algorithm anticipating the possibility of an abrupt change must 
frequently explore all suboptimal arms to check that no change has occurred. 

There are algorithms designed for non-stationary bandits in the stochastic 
setting with abrupt change points as described above. Those that come with 
theoretical guarantees are based on forgetting or discounting data so that decisions 
of the algorithm depend almost entirely on recent data. In the notes, we discuss 
these approaches along with alternative models for non-stationarity. For now, 
the advantage of the stochastic setting seems to be that in the stochastic setting 
there are algorithms that do not need to know the number of changes, while, as 
noted beforehand, such algorithms are not yet known (or maybe not possible) in 
the nonstochastic setting. 


Proof of Theorem 31.2 Let ($;)4_, be a uniform partition of [n] into successive 
intervals. Let P and E[-] denote the probabilities and expectations with respect 
to the bandit determined by u and P’ with respect to alternative non-stationary 
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bandit yz’ to be defined shortly. By the pigeonhole principle, there exists a j € [L] 
such that 


[T(n] 


IX {Ar = 2}] < (31.5) 


tes; 


Define an alternative non-stationary bandit with w'(t) = u except for t € S} 
when we let u(t) = u2 + £, where e€ = ,/2L/E[T>(n)] while y(t) = pw. Then, 
by Theorem 14.2 and Lemma 15.1, 


S;| |S;| 1 

= > I / = J > / 

P| So HA =2}> re So {Ay = 2} < = | 25 exp(-D(PP’)) 
tes; tes; 

[To(n)Je2\ 1 
2L — Qe" 


ye 
ex 
Z3 P 


By Markov’s inequality and Eq. (31.5), 


[S;| 2 2E[T2(n)] 1 
P J I4 = 2} > — | < —E J I{A,=2}] < < 
TE |S; EE LS = AlS 


tESj tes; 


S 


where the last inequality follows by choosing L = [2APE[T>(n)]| and assuming n 
is large enough that L < n. Then € > 2A so that p’ satisfies the assumptions of 
the theorem. Therefore, 


1 1 \ |S; 1 1 \ (ISjJA jej 1 1 
N> IVs J > : 
Rr(u’) = (= A?|S; ) A = (= soe} 2 7 z] 4e 2A 


Then, using Rn(u) = AE[T2(n)], the definition of L and the assumption that 
Rn(u) = o(n), it follows that for sufficiently large n, 


n 


Ralu’) = PR (a) 


where the constant is chosen so that 1/22 < 1/(8e). 


Notes 


1 Environments that appear non-stationary can often be made stationary by 
adding context. For example, when bandit algorithms are used for on-line 
advertising, gym membership advertisements are received more positively in 
January than July. A bandit algorithm that is oblivious to the time of year 
will perceive this environment as non-stationary. You could tackle this problem 
by using one of the algorithms in this chapter. Or you could use a contextual 
bandit algorithm and include the time of year in the context. The reader is 
encouraged to consider whether or not adding contextual information might 
be preferable to using an algorithm designed for non-stationary bandits. 


31.3 Notes 384 


2 The negative results for stochastic non-stationary bandits do not mean that 
trying to improve on the adversarial bandit algorithms is completely hopeless. 
First of all, the adversarial bandit algorithms are not well suited for exploiting 
distributional assumptions on the noise, which makes things irritating when 
the losses/rewards are Gaussian (which are unbounded) or Bernoulli (which 
have small variance near the boundaries). There have been several algorithms 
designed specifically for stochastic non-stationary bandits. When the reward 
distributions are permitted to change abruptly, as in the last section, then the 
two main algorithms are based on the idea of ‘forgetting’ rewards observed in 
the distant past. One way to do this is with discounting. Let y € (0,1) be 
the discount factor , and define 


A(t) = Soy IA, =i} X, TP = YI {A = i}. 


s=1 
Then, for appropriately tuned constant a, the discounted UCB policy chooses 
each arm once and subsequently 


k 
A, = argmaX;cj | fi) (t-1) 4 TU- D log (>: T? (t- o) 
The idea is to ‘discount’ rewards that occurred far in the past, which makes 
the algorithm most influenced by recent events. A similar algorithm called 
sliding-window UCB uses a similar approach, but rather than discounting past 
rewards with a geometric discount function, it simply discards them altogether. 
Let r € Nt be a constant, and define 
t t 


mO= E s=} A= E HA =i}. 


s=t—T+1 s=t—T+1 


Then sliding-window UCB chooses 


ne a 
A, = argmaxie ,) (a (t—1)4 la =H log(t A ) ; 


Regrettably, however, these algorithms suffer from a tuning problem. There 
is no choice of y and 7 for which the algorithms enjoy R, = O(,/nlog(n)) in 
a minimax sense. On the positive side, there is empirical evidence to support 
the use of these algorithms when the stochastic assumption holds. Recently, 
more complicated algorithms were proposed that can adapt to the number 


of switches in a stochastic environment and match the regret of an optimally 
tuned adversarial algorithm [Auer et al., 2019, Chen et al., 2019]. 

3 An alternative way to model non-stationarity in stochastic bandits is to assume 
the mean pay-offs of the arms are slowly drifting. One way to do this is to 
assume that u(t) follows a reflected Brownian motion in some interval. It is 
not hard to see that the regret is necessary linear in this case because the best 
arm changes in any round with constant probability. The objective in this case 
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is to understand the magnitude of the linear regret in terms of the size of the 
interval or volatility of the Brownian motion. 
4 Yet another idea is to allow the means to change in an arbitrary way, but 

restrict the amount of total variation. Let + = (f41(t),..., ux(t)) and 

n-1 

Vn = 5 let — Ht+1llo 

t=1 
be the cumulative change in mean rewards measured in terms of the supremum 
norm. Then, for each V € [1/k,n/k], there exists a policy such that for all 
bandits with V, < V, it holds that 


Rn < O(V klog(k)) n? . (31.6) 


This bound is nearly tight in a minimax sense. The lower bound is obtained by 
partitioning [n] into m parts, where in each part all arms have equal means 
except for the optimal arm, which is better by A = c\/mk/n for universal 
constant c € R. The usual argument shows that the total regret is Q(v kmn), 
while V, < 2cm3/?,/k/n. Tuning m so that Vp < V completes the proof. 
Recent work shows that it is possible to achieve Eq. (31.6) without knowing 
V. That is, there exists an algorithm that is able to adapt to V. In fact, 
the algorithm mentioned in Note 2, which is able to adapt to the number of 
switches, can accomplish this. 


Bibliographic Remarks 


Non-stationary bandits have quite a long history. The celebrated Gittins index is 
based on a model where each arm is associated with a Markov chain that evolves 
when played, the reward depends on the state, and the state of the chosen Markov 
chain is observed after it evolves [Gittins, 1979, Gittins et al., 2011]. The classical 
approaches, as discussed in Chapter 35, address this problem in the Bayesian 
framework, and the objective is primarily to design efficient algorithms rather 
than understanding the frequentist regret. Even more related is the restless 
bandit, which is the same as Gittins’s set-up except the Markov chain for every 
arm evolves in every round, while the learner still only observes the state and 
reward for the action they chose. As a result, the learner needs to reason about the 
evolution of all the Markov chains, which makes this problem rather challenging. 
Restless bandits were introduced by Whittle [1988] in the Bayesian framework, 
where most of the results are not especially positive. There has been some interest 
in a frequentist analysis, but the challenging nature of the problem makes it 
difficult to design efficient algorithms with meaningful regret guarantees [Ortner 
et al., 2012]. Certainly there is potential for more work in this area. 

The ideas in Section 31.1 are mostly generalisations of algorithms designed 
for the full information setting, notably the fixed share algorithm [Herbster and 
Warmuth, 1998]. The first algorithm designed for the adversarial non-stationary 
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bandit is Exp3.S by Auer et al. [2002b]. This algorithm can be interpreted as an 
efficient version of Exp4, where experts correspond to sequences of actions that 
have the permitted number of changes and where the initialisation is carefully 
chosen so that the computation needed to run Exp4 is made tractable [Gyorgy 
et al., 2019]. See also the analysis of fixed share in the book by Cesa-Bianchi 
and Lugosi [2006]. The Exp3.P policy was originally developed in order to prove 
high-probability bounds for finite-armed adversarial bandits [Auer et al., 2002p], 
but Audibert and Bubeck [2010b] proved that with appropriate tuning it also 
enjoys the same bounds as Exp3.S. Presumably this also holds for Exp3-IX. 
Mirror descent has been used to prove tracking bounds in the full information 
setting by Herbster and Warmuth [2001]. A more recent reference is by György 
and Szepesvari [2016], which makes the justification for clipping explicit. The 
latter paper considers the linear prediction setting and provides bounds on the 
regret that scale with the complexity of the sequence of losses as measured by 
the cumulative change of consecutive loss vectors. The advantage of this is that 
the complexity measure can distinguish between abrupt and gradual changes. 
This is similar to the approach of Besbes et al. [2014]. The lower bound for 
stochastic non-stationary bandits is by Garivier and Moulines [2011], though 
our proof differs in minor ways. We mentioned that there is a line of work on 
stochastic non-stationary bandits where the rewards are slowly drifting. The 
approach based on Brownian motion is due to Slivkins and Upfal [2008], while 
the variant described in Note 4 is by Besbes et al. [2014], who also gave the lower 
bound described there. The idea of discounted UCB was introduced without 
analysis by Kocsis and Szepesvari [2006]. The analysis of this algorithm and 
also of sliding-window UCB algorithm is by Garivier and Moulines [2011]. The 
sliding-window algorithm has been extended to linear bandits [Cheung et al., 
2019] and learning in Markov decision processes [Gajane et al., 2018]. Contextual 
bandits have also been studied in the non-stationary setting [Luo et al., 2018, 
Chen et al., 2019]. We are not aware of an algorithm for the adversarial setting 
with Rn = O(,/mknlog(n)) when the number of switches is unknown. Auer 
et al. [2018] prove a bound of Rn = O(Vmknlog(n)) in the stochastic setting 
when k = 2. The idea underlying this work has been extended to the k-armed 
case [Auer et al., 2019], as well as to the contextual case [Chen et al., 2019], 
the latter of which also shows that adapting to the total shift of distributions 
described in Note 4 is possible. The key novelty in these algorithms is adding 
explicit exploration whose durations are multi-scale, which is made possible by 
extra randomisation. 


Exercises 
31.1 (ExP4 FOR NON-STATIONARY BANDITS) Let n,m,k € N+. Prove (31.2). 


In particular, specify first what the experts predict in each round and how 
Theorem 18.1 gives rise to (31.1) and how (31.2) follows from (31.1). 
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Hint For the second part, you may find it useful to show the following well- 
known inequality: for 0 < m < n, defining ®,,(n) = oe", Cy. it holds that 
(m/n) m(n) < e™. 


31.2 (LOWER BOUND FOR ADVERSARIAL NON-STATIONARY BANDITS) Let 
n,m, k € N* be such that n > mk. Prove that for any policy m there exists an 
adversarial bandit (y+) such that 


Raum > cVnmk, 


where c > 0 is a universal constant. 


31.3 (UNSUITABILITY OF EXP3 FOR NON-STATIONARY BANDITS) Prove for all 
sufficiently large n that Exp3 from Chapter 11 has Rn2 > cn for some universal 
constant c > 0. 


31.4 (EMPIRICAL COMPARISON) Let k = 2 and n = 1000, and define adversarial 
bandit in terms of losses with y4 = 1{t < n/2} and yt2 =1{t > n/2}. Plot the 
expected regret of Exp3, Exp3-IX and the variant of online stochastic mirror 
descent proposed in this chapter. Experiment with a number of learning rates for 
each algorithm. 
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Ranking 


Ranking is the process of producing 
an ordered shortlist of m items from a 
larger collection of £ items. These tasks 
come in several flavours. Sometimes the 
user supplies a query, and the system 


responds with a shortlist of items. 


In other applications the shortlist is 


produced without an explicit query. 


For example, a streaming service might 
provide a list of recommended movies 
when you sign in. Our focus here is on 
the second type of problem. 

We examine a sequential version of 
the ranking problem where the learner 
selects a ranking, receives feedback 
about its quality and repeats the 
process over n rounds. The feedback 


? SPE 


Figure 32.1 A classic ranking problem: 
which hats to put where on the stand? Higher 
and towards the front attracts more attention. 


will be in the form of ‘clicks’ from the user, which comes from the view that 
ranking is a common application in on-line recommendation systems and the user 
selects the items they like by clicking on them. The objective of the learner is to 


maximise the expected number of clicks. 


Ranking is a huge topic, and our approach is necessarily quite narrow. In fact 
there is still a long way to go before we have a genuinely practical algorithm 


for large-scale online ranking problems. As usual, we summarise alternative 


ideas in the notes. 


Stochastic Ranking 


A permutation on [4] is an invertible function ø : [¢] > [¢]. Let A be the set of 
all permutations on [£]. In each round t the learner chooses an action A; € A, 
which should be interpreted as meaning the learner places item A;(k) in the kth 
position. Equivalently, A7 ‘(i) is the position of the ith item. Since the shortlist 


has length m, the order of A;(m+1),.. 


., A;(£) is not important and is included 
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only for notational convenience. After choosing their action, the learner observes 
Cri € {0,1} for each i € [4], where Cr = 1 if the user clicked on the ith item. 
Note that the user may click on multiple items. We will assume a stochastic 
model where the probability that the user clicks on position k in round t only 
depends on A; and is given by v(A,, k), with v : A x [4] > [0,1] an unknown 
function. The regret over n rounds is 


£ n £ 
Rn = nmax v(a,k) —E b > Cri 


k=1 t=1 i=1 


A naive way to minimise the regret would be to create a finite-armed bandit 
where each arm corresponds to a ranking of the items and then apply your 
favourite algorithm from Part II. The problem is that these algorithms treat the 
arms as independent and cannot exploit any structure in the ranking. This is 
almost always unacceptable because the number of ways to rank m items from a 
collection of size £ is ¢!/(€—m)!. Ranking illustrates one of the most fundamental 
dilemmas in machine learning: choosing a model. A rich model leads to low 
misspecification error, but takes longer to fit. A coarse model can suffer from 
large misspecification error. In the context of ranking, a model corresponds to 
assumptions on the function v. 


Click Models 


The only way to avoid the curse of dimensionality is to make assumptions. A 
natural way to do this for ranking is to assume that the probability of clicking on 
an item depends on (a) the underlying quality of that item and (b) the location of 
that item in the chosen ranking. A formal definition of how this is done is called 
a click model. Deciding which model to use depends on the particulars of the 
problem at hand, such as how the list is presented to the user and whether or not 
clicking on an item diverts them to a different page. This issue has been studied 
by the data retrieval community, and there is now a large literature devoted 
to the pros and cons of different choices. We limit ourselves to describing the 
popular choices and give pointers to the literature at the end of the chapter. 


Document-Based Model 

The document-based model is one of the simplest click models, which assumes 
the probability of clicking on a shortlisted item is equal to its attractiveness. 
Formally, for each item i € [4], let a(i) € [0,1] be the attractiveness of item i. 
The document-based model assumes that 


v(a, k) = a(a(k))1{k < m}. 


The unknown quantity in this model is the attractiveness function, which has 
just l parameters. 
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Position-Based Model 

The document-based model might occasionally be justified, but in most cases the 
position of an item in the ranking also affects the likelihood of a click. A natural 
extension that accounts for this behaviour is called the position-based model, 
which assumes that 


v(a, k) = a(a(k))x(k) , 


where x : [¢] — [0, 1] is a function that measures the quality of position k. Since 
the user cannot click on items that are not shown, we assume that x(k) = 0 for 
k >m. This model is richer than the document-based model, which is recovered 
by choosing x(k) =I{k < m}. The number of parameters in the position-based 
models is m + £. 


Cascade Model 

The position-based model is not suitable for applications where clicking on an 
item takes the user to a different page. In the cascade model, it is assumed 
that the learner scans the shortlisted items in order and only clicks on the first 
item they find attractive. Define x : A x [£] — [0,1] by 


1 ifk=1 
x(a,k) = 40 ifk>m 
Ti, (1 — a(a(k’))) otherwise, 


which is the probability that the user has not clicked on the first k — 1 items. 
Then the cascade model assumes that 


v(a, k) = a(a(k))x(a, k). (32.1) 


The first term in the factorisation is the attractiveness function, which measures 
the probability that the user is attracted to the ith item. The second term can be 
interpreted as the probability that the user examines that item. This interpretation 
is also valid in the position-based model. It is important to emphasise that 
v(a, k) is the probability of clicking on the kth position when taking action 
a € A. This does not mean that Ci,,..., Ce are independent. The assumptions 
only restricts the marginal distribution of each Cy, which is sufficient for our 
purposes. Nevertheless, in the cascade model, it would be standard to assume 
that Ci4,(n) = 0 if there exists an k’ < k such that C,4,(x/) = 1, and otherwise 


P(Cra,(e) = 1| At, Ceasa) = 9,---;Craye—1y) = 9) = 1{k < m} a(A(k)). 


Like the document-based model, the cascade model has ¢ parameters. 


Generic Model 

We now introduce a model that generalises the last three. Previous models 
essentially assumed that the probability of a click factorises into an attractiveness 
probability and an examination probability. We deviate from this norm by making 


32.1 Click Models 391 


a a 
1 
2 i J 
xX 
4 j i 
5 


Figure 32.2 Part (c) of Assumption 32.1 says that the probability of clicking in the 
second position on the left list is larger than the probability of clicking on the second 
position on the right list by a factor of a(i)/a(j). For the fourth position, the probability 
is larger for the right list than the left by the same factor. 


assumptions directly on the function v. Given a : [4] > [0,1], an action a is 
called a-optimal if the shortlisted items are the m most attractive sorted by 
attractiveness: a(a(k)) = maxyr>,% a(a(k’)) for all k € [m]. 


ASSUMPTION 32.1. There exists an attractiveness function a : [€] > [0,1] such 
that the following four conditions are satisfied. Let a € A and i, j, k € [4] be such 
that a(i) > a(j), and let o be the permutation that exchanges i and j. 


(a) v(a,k) =0 for all k > m. 

(b) Sov, u(a*, k) = maxaca X; v(a, k) for all a-optimal actions a*. 
(c) For all i and j with a(i) > aly), 

a(i) 
a(j) 
where ø is the permutation on [4] that exchanges i and j. 


(a) If ais an action such that a(a(k)) = a(a*(k)) for some a-optimal action a*, 
then v(a, k) > v(a*,k) . 


v(a,a~'(i)) > v(aoa,a_‘(i)), 


These assumptions may appear quite mysterious. At some level they are 
chosen to make the proof go through, while simultaneously generalising the 
document-based, position-based and cascade models (32.1). The choices are 
not entirely without basis or intuition, however. Part (a) asserts that the user 
does not click on items that are not placed in the shortlist. Part (b) says that 
a-optimal actions maximise the expected number of clicks. Note that there 
are multiple optimal rankings if a is not injective. Part (c) is a little more 
restrictive and is illustrated in Fig. 32.2. One way to justify this is to assume 
that v(a, k) = a(a(k))x(a,k), where x(a, k) is viewed as the probability that the 
user examines position k. It seems reasonable to assume that the probability 
the user examines position k should only depend on the first k — 1 items. Hence 
v(a,2) = a(i)x(a,2) = a(t)x(a’,2) = a(i)/a(j)v(a’, 2). In order the make the 
argument for the fourth position, we need to assume that placing less attractive 
items in the early slots increases the probability that the user examines later 
positions (searching for a good result). This is true for the position-based and 
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cascade models, but is perhaps the most easily criticised assumption. Part (d) 
says that the probability that a user clicks on a position with a correctly placed 
item is at least as large as the probability that the user clicks on that position in 
an optimal ranking. The justification is that the items a(1),...,a(k — 1) cannot 
be more attractive than a*(1),...,a*(k—1), which should increase the likelihood 
that the user makes it the kth position. 

The generic model has many parameters, but we will see that the learner does 
not need to learn all of them in order to suffer small regret. The advantage of 
this model relative to the previous ones is that it offers more flexibility, and yet 
it is not so flexible that learning is impossible. 


Policy 


We now explain the policy for learning to rank when v is unknown, but satisfies 
Assumption 32.1. After the description is an illustration that may prove helpful. 


Step 0: Initialisation 

The policy takes as input a confidence parameter 6 € (0,1) and £ and m. The 
policy maintains a binary relation G; C [¢] x [4. In the first round t = 1 the 
relation is empty: G; = Ø. You should think of G; as maintaining pairs (i, j) 
for which the policy has proven with high probability that a(i) < a(j). Ideally, 
Ge C {(7, j) € [8 x [4 : a(i) < a(9)}- 


Step 1: Defining a Partition 

In each round t, the learner computes a partition of the actions based on a 
topological sort according to relation G+. Given A C [4], define ming,(A) to be 
the set of minimum elements of A according to relation Gy: 


ming,(A) = {i € A: (i, j) ¢ G, for all j E€ Gi} . 
Then let Py, Piz, ... be the partition of [¢] defined inductively by 


d—1 
Pra = ming, (i \U P) l 


Finally, let M; = max{d : Pia 4 0}. The reader should check that if G; does not 
have cycles, then M is well defined and finite and that Pi, ..., Pem, is indeed a 
partition of [4] (Exercise 32.5). The event that G; contains cycles is a failure event. 
In order for the policy to be well defined, we assume it chooses some arbitrary 
fixed action in this case. 


Step 2: Choosing an Action 
Let T1,- .., Tem, be a partition of [4] defined inductively by 


Tia = [lVexaP tel] \ [[Ue<aP tel] - 
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Next let X, C A be the set of actions o such that o(Zia) = Pra for all d € [M4]. 
The algorithm chooses A; uniformly at random from >. Intuitively the policy 
first shuffles the items in Py; and uses these as the first |P;1| entries in the ranking. 
Then P;2 is shuffled, and the items are appended to the ranking. This process is 
repeated until the ranking is complete. For an item i € [4], we denote by Dz; the 
unique index d such that i € Pia- 


Step 3: Updating the Relation 
For any pair of items i,j E [2], define Stij = ar Usij and Nez = Di Le aaa ls 
where 


Ue; =1{ Dg = Dy} (Cy — Cy). 


All this means is that Sy; tracks the difference between the number of clicks 
on items 7 and j over rounds when they share a partition. As a final step, the 
relation G41 is given by 


Cy/ N, ij 
Gi41 = Gi U (j, i) : Stij > 2Niij log ( 5 : z) à 


where c ~ 3.43 is the universal constant given in Exercise 20.10. In the analysis we 
will show that if a(i) > a(j), then with high probability Stj; is never large enough 
for G41 to include (i, j). In this sense, with high probability, G; is consistent 
with the order on [4] induced by sorting in decreasing order with respect to a(-). 
Note that G; is generally not a partial order because it need not be transitive. 


Illustration 

Suppose £ = 5 and m = 4, and in round t the relation is G; = {(3, 1), (5, 2), (5,3)}, 
which is represented in the graph below, where an arrow from j to i indicates 
that (j,i) € Gi. 


Pr Tii = {1,2,3} 
Piz Lig = {4} 
Pris Tig = {5} 


This means that in round ¢ the first three positions in the ranking will contain 
items from Pau = {1,2,4} but with random order. The fourth position will be 
item 3, and item 5 is not shown to the user. 
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Part (a) of Assumption 32.1 means that items in position k > m are never 
clicked. As a consequence, the algorithm never needs to actually compute 
the partitions Pa for which min Zia > m because items in these partitions 
are never shortlisted. 


Regret Analysis 


THEOREM 32.2. Let v satisfy Assumption 32.1, and assume that a(1) > a(2) > 
-> a(l). Let Aj; = a(t) — a(j) and ô € (0,1). Then the regret of TopRank is 
bounded by 


L min{m,j-1} 6(a(i) + a(j)) log (2 
Ry < dnm? + 5° 5 1+ A; ( ` 
j=1 i=1 ” 


Furthermore, Ry < ônml? + ml + jarem log (25) ; 


By choosing 6 = n~! the theorem shows that the expected regret is at most 


L min{m,j—1} 


Rn=0(S> © ot) fost) and Rn = O( m€n login) ) . 
j=l =1 K 


The algorithm does not make use of any assumed ordering on a(-), so the 
assumption is only used to allow for a simple expression for the regret. The core 
idea of the proof is to show that (a) if the algorithm is suffering regret as a 
consequence of misplacing an item, then it is gaining information so that G, will 
get larger and, (b) once G; is sufficiently rich, the algorithm is playing optimally. 
Let Fy = o (A1, C1, .. . , At, Cy) and P;(-) = P(-| F) and El] = E[-| 7]. For each 
t € [n], let F, be the failure event that there exists i 4 j € [¢] and s < t such that 
Neij > 0 and 


Bog — J Buca [Uuig | Uuig # 0] Wal) > V2Naig log(ey Nsiz/9) - 
u=1 
LEMMA 32.3. Let i and j satisfy a(i) > a(j) and d > 1. On the event that 
i,j E Psa and d E€ [M,] and Usi; £0, the following hold almost surely: 


> Aij 
(a) Es-1[Usi; | Usi; #0] > aera). 
(b) Es—1[Usji| Usj #0] <0. 


Proof For the remainder of the proof, we focus on the event that i, j € Psq and 
d € [M,] and Usi; # 0. We also discard the measure zero subset of this event where 
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Ps—1(Usij 4 0) = 0. From now on, we omit the ‘almost surely’ qualification on 
conditional expectations. Under these circumstances, the definition of conditional 
expectation shows that 


Ps—1(Csi = 1, C's; = 0) — Ps_-1(Csi = 0, Cs; = 1) 
Ps—1(Csi £ Caz) 
Ps—1(Csi = 1) — Ps1 (Cs; = 1) 
Ps—1(Csi # Cay) 
Ps-1(Csi = 1) — P51 (Cs; = 1) 
Ps—1(Csi = 1) + Ps-i (Caz = 1) 
[ 
[v 


2s—1[Usi; | Usi; FO] = 


2s 1lv(As, A310) — v(As, As (3))] 
(As, As (i)) + v(As, As (3) 
where in the second equality we added and subtracted Ps—1(Csi = 1, Csj = 1). 
By the design of TopRank, the items in Pa are placed into slots Zq uniformly at 


random. Let o be the permutation that exchanges the positions of items 7 and j. 
Then using Part (c) of Assumption 32.1, 


is—1[0(As, Az *(@))] = D> Pe-1(As = a)v(a,a7*(8)) 


(32.2) 


s—1|U 


IV 
Q 
P 
~ 
= 
izj 
va] 
| 
i 
a 
> 
w 
II 
Qa 
k r 
c 
fas, 
q 
o 
Q 
2 
È 
yae; 
> 
= 
~—* 


= GG) 2a Pls = 2 o aulo o a, (o 0 a)™()) 


= Es [v(4s, A3" (4))], 


where the second equality follows from the fact that a7! (i) = (c0a)~'(j) and the 
definition of the algorithm ensuring that P,-1(A, = a) = Ps_1(A, = ø o a). The 
last equality follows from the fact that o is a bijection. Using this and continuing 
the calculation in Eq. (32.2) shows that 


_ Us—1 [v(As, Az *(é)) — (As, A,*(3))] 
Fa (82.2) = [uA Ar *(0) + (4s, A N 
=] IFs =i 
1+ Es- [v(4s, Az" (i))] /Es-1 [v(4s, Az" (5))] 
2 
AEETI 
a- oj) _ Ay 
a(i)+a(j) a(i) + a(9) | 
The second part follows from the first since Usji = —U sij- 


The next lemma shows that the failure event occurs with low probability. 


LEMMA 32.4. It holds that P(F,) < 6€?. 
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Proof The proof follows immediately from Lemma 32.3, the definition of Fn, the 
union bound over all pairs of actions, and a modification of the Azuma—Hoeffding 


inequality in Exercise 20.10. 


LEMMA 32.5. On the event Ff, it holds that (i, j) ¢ G, for alli < j. 


Proof Let i<j so that a(i) > a(j). On the event Ff, either Nsji = 0 or 


S 


Seji — X Eu-1 [Uui | Uui # OlUuzil < [2a log (EV Naz) for alls < t. 


u=1 


When 7 and j are in different blocks in round u < t, then Uuji = 0 by definition. 
On the other hand, when i and j are in the same block, E,—1[Uuji | Uuji 4 0] < 0 
almost surely by Lemma 32.3. Based on these observations, 


C 


Ssji < [2x log (5 Nazi) for alls <t, 


which by the design of TopRank implies that (i, j) Gr. 


LEMMA 32.6. Let I% = minPyq be the most attractive item in Pia. Then, on 
event Ff, it holds that If, < 1+ Xoca [Pral for all d € [Mi]. 


Proof Let i* = min Ue>aPrc. Then i* < 1+ $ o<a |Pta] holds trivially for any 
Pu,- -, Pem, and d € [M]. Now consider two cases. Suppose that i* € Pia. Then 
it must be true that i* = Iž}, and our claim holds. On the other hand, suppose 
that i* € Pie for some c > d. Then by Lemma 32.5 and the design of the partition, 
there must exist a sequence of items ig,...,2- in blocks Pig,..., Pte such that 
ia <+: < ie = i*. From the definition of 1%, If; < ta < i“. This concludes our 
proof. 


LEMMA 32.7. On the event F$ and for alli < j, it holds that 


_ Sali) + a3) cyn 
Basiran log ( s | 


i) 


Proof The result is trivial when Nnij = 0. Assume from now on that Ni; > 0. 
By the definition of the algorithm, arms i and j are not in the same block once 
Stij grows too large relative to Nij, which means that 


Snijg lF [2 log (Ey Nais) . 


On the event F£ and part (a) of Lemma 32.3, it also follows that 


Snij > ane — [2s log (Gv Naiz) 7 


a(t) +a(j 
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Combining the previous two displays shows that 


Aig Nnis aioe” ee sel << Ne 
ali) + a(j) ~ [2 log (5 Ndi) < Snij < 1+ [2 log (5 Naiz) 
< (1+ V3)4] Nuis log (E V/a) - (32.3) 


Using the fact that Nnij < n and rearranging the terms in the previous display 
shows that 


Nous < (1+ 2V2)?(a(i) + a(9))? loz (22) l 


ij S A?, 


The result is completed by substituting this into Eq. (32.3). 


Proof of Theorem 32.2 The first step in the proof is an upper bound on the 
expected number of clicks in the optimal list a*. Fix time t, block Pia and recall 
that Iý; = min Pza is the most attractive item in Pa. Let k = ASU) be the 
position of item J, and o be the permutation that exchanges items k and Iž}. 
By Lemma 32.6, on the event Ff, we have Iž; < k. From Parts (c) and (d) of 
Assumption 32.1, we have v( Ar, k) > v(o o At, k) > v(a*, k). Hence, on the event 
Ff, the expected number of clicks on Jf, is bounded from below by those on 
items in a*, 


tra [Cor] = > Pea (AF Ea) = &)E1alo(Ae, K) | Aa) = A] 


kETia 
1 1 

= — yi v( A, k A7! i = k| > v(a*, k), 
ad Dp EMAD AP) =H oe Do oe 


where we also used the fact that TopRank randomises within each block to 
guarantee that P;_1(A; ‘(I*,) = k) =1/|Ztal for any k € Tia. Using this and the 
design of TopRank, 


m Mt M: 
SoS D vae < Yo [alB [Cir]; 
k=1 d=1 k€Zia d=1 


Therefore, under event Ff, the conditional expected regret in round t is bounded 
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by 


m g M: 4 
4 ` , 
X v(a*, k) — Ey_1 X Cry | < Er- X [PralCur;, -— > Cry 
k=1 j=1 d=1 j=l 
M; 
= Ey-1 X X (Cire, — Ciz) 
d=1 j€Pta 
M: 
=), J EUr] 
d=1 jEPra 


£ min{m,j—1} 


TE XO Ea [Us]. (32.4) 


i=l 


The last inequality follows by noting that E,—1[Uzr-,;] < poe UE, 1[Uti;]. 
To see this, use part (a) of Lemma 32.3 to show that Ey_;[Uz;;] > 0 for i < j and 
Lemma 32.6 to show that when J, > m, then neither Iž} nor j are not shown to 
the user in round ¢ so that U;r-,; = 0. Substituting the bound in Eq. (32.4) into 
the regret leads to 


£ min{m,j—1} 


R,<nmP(Fn)+ >> D> EHER Snil, (32.5) 


where we used the fact that the maximum number of clicks over n rounds is 
nm. The proof of the first part is completed by using Lemma 32.4 to bound 
the first term and Lemma 32.7 to bound the second. The problem-independent 
bound follows from Eq. (32.5) and by stopping early in the proof of Lemma 32.7 


(Exercise 32.6). 


Notes 


— 


N 


At no point in the analysis did we use the fact that v is fixed over time. Suppose 
that v1,..., Un are a sequence of click-probability functions that all satisfy 
Assumption 32.1 with the same attractiveness function. The regret in this 
setting is 


n m n £ 
R, = >_>) v(a", k)-E oye, 


t=1 k=1 t=1 i=1 


Then the bounds in Theorem 32.2 still hold without changing the algorithm. 
The cascade model is usually formalised in the following more restrictive fashion. 
Let {Zn : i € [€,t € [n]} be a collection of independent Bernoulli random 
variables with P(Z,; = 1) = a(i). Then define M; as the first item 7 in the 
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shortlist with Z = 1: 
M; = min {k € [m] : Zrak) = 1} , 


where the minimum of an empty set is oo. Finally let Cy; = 1 if and only if 
M, < m and A;(M;) = i. This set-up satisfies Eq. (32.1), but the independence 
assumption makes it possible to estimate a without randomisation. Notice that 
in any round t with M; < m, all items i with A;'(i) < M, must have been 
unattractive (Zs = 0), while the clicked item must be attractive (Zu = 1). 
This fact can be used in combination with standard concentration analysis to 
estimate the attractiveness. The optimistic policy sorts the £ items in decreasing 
order by their upper confidence bounds and shortlists the first m. When the 
confidence bounds are derived from Hoeffding’s inequality , this policy is called 
CascadeUCB, while the policy that uses Chernoff’s lemma is called CascadeKL- 
UCB. The computational cost of the latter policy is marginally higher than 
the former, but the improvement is also quite significant because in practice 
most items have barely positive attractiveness. 
The linear dependence of the regret on £ is unpleasant when the number of 
items is large, which is the case in many practical problems. Like for finite- 
armed bandits, one can introduce a linear structure on the items by assuming 
that a(i) = (0,¢;) where 0 € R? is an unknown parameter vector and (¢;)f_, 
are known feature vectors. This has been investigated in the cascade model by 
Zong et al. [2016] and with a model resembling that of this chapter by Li et al. 
[2019a]. 
There is an adversarial variant of the cascade model. In the ranked bandit 
model an adversary secretly chooses a sequence of sets $),...,S,, with S; C [4]. 
In each round t the learner chooses A; € A and receives a reward X;(A;), where 
Xı: A > [0,1] is given by X:(a) = I {S4 N {a(1),...,a(k)} # Ø}. The feedback 
is the position of the clicked action, which is M; = min{k € [m] : Az(k) € St}. 
The regret is 

nm 


Ry = X (Xela) = X;(A:)) ’ 


t=1 


where a, is the optimal ranking in hindsight: 


n 

ay = argminge A 5 X;(a). (32.6) 
t=1 

Notice that this is the same as the cascade model when S; = {i : Zu = 1}. 

A challenge in the ranked bandit model is that solving the offline problem (Eq. 

32.6) for known S1,..., Sn is NP-hard. How can one learn when finding an 

optimal solution to the offline problem is hard? First, hardness only matters if 

|A] is large. When £ and m are not too large, then exhaustive search is quite 

feasible. If this is not an option, one may use an approximation algorithm. 

It turns out that in a certain sense, the best one can do is to use a greedy 
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algorithm, We omit the details, but the highlight is that there exist efficient 
algorithms such that 


[>> X(AD 


See the article by Radlinski et al. [2008] for more details. 

By modifying the reward function, one can also define an adversarial variant 
of the document-based model. As in the previous note, the adversary secretly 
chooses $},..., Sn as subsets of [£], but now the reward is 


X,(a) = |S; {a(1),...,a(k)}| . 


The feedback is the positions of the clicked items, S;M {a(1),...,a(k)}. For 
this model, there are no computation issues. In fact, the problem can be 
analysed using a reduction to combinatorial semi-bandits, which we ask you to 
investigate in Exercise 32.3. 

The position-based model can also be modelled in the adversarial setting by 
letting Stp C [4] for each t € [n] and k € [m]. Then, defining the reward by 


e 


> >(1 m 3 nag Xa) -0 (m/nblog(®) l 


D 


N 


= STA) E Sik} . 
k= 


Again, the feedback is the positions of the clicked items, {k € [m] : A¢(k) € Sik} 
This model can also be tackled using algorithms for combinatorial semi-bandits 
(Exercise 32.4). 
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a different algorithm for the same model that is also asymptotically optimal. The 
optimal regret has a complicated form and is not given explicitly in all generality. 
We remarked in the notes that the linear dependence on £ is problematic for 
large Z. To overcome this problem, Zong et al. [2016] introduce a linear variant 
where the attractiveness of an item is assumed to be an inner product between an 
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unknown parameter and a known feature vector. A slightly generalised version of 
this set-up was simultaneously studied by Li et al. [2016], who allowed the features 
associated with each item to change from round to round. The position-based 
model is studied by Lagree et al. [2016], who suggest several algorithms and 
provide logarithmic regret analysis for some of them. Asymptotic lower bounds 
are also given that match the upper bounds in some regimes. Katariya et al. [2016] 
study the dependent click model introduced by Guo et al. [2009]. This differs 
from the models proposed in this chapter because the reward is not assumed 
to be the number of clicks and is actually unobserved. We leave the reader to 
explore this interesting model on their own. The adversarial variant of the ranking 
problem mentioned in the notes is due to Radlinski et al. [2008]. Another related 
problem is the rank-1 bandit problem, where the learner chooses one of £ items 
to place in one of m positions, with all other positions left empty. This model has 
been investigated by Katariya et al. [2017a,b], who assume the position-based 
model. The cascade feedback model is also used in a combinatorial setting by 
Kveton et al. [2015c], but this paper does not have a direct application to ranking. 
A more in-depth discussion on ranking can be found in the recent book on bandits 
in information retrieval by Glowacka [2019], which discusses a number of practical 
considerations, like the cold-start problem. 


Exercises 


32.1 (CLICK MODELS AND ASSUMPTIONS) Show that the document-based, 
position-based and cascade models all satisfy Assumption 32.1. 


32.2 (DIVERSITY) Most ranking algorithms are based on assigning an 
attractiveness value to each item and shortlisting the m most attractive items. 
Radlinski et al. [2008] criticise this approach in their paper as follows: 


“The theoretical model that justifies ranking documents in this way is the probabilistic 
ranking principle [Robertson, 1977]. It suggests that documents should be ranked by their 
probability of relevance to the query. However, the optimality of such a ranking relies 
on the assumption that there are no statistical dependencies between the probabilities 
of relevance among documents — an assumption that is clearly violated in practice. For 
example, if one document about jaguar cars is not relevant to a user who issues the 
query jaguar, other car pages become less likely to be relevant. Furthermore, empirical 
studies have shown that given a fixed query, the same document can have different 
relevance to different users [Teevan et al., 2007]. This undermines the assumption that 
each document has a single relevance score that can be provided as training data to the 
learning algorithm. Finally, as users are usually satisfied with finding a small number of, 
or even just one, relevant document, the usefulness and relevance of a document does 
depend on other documents ranked higher.” 


The optimality criterion Radlinski et al. [2008] had in mind is to present at least 
one item that the user is attracted to. Do you find this argument convincing? 
Why or why not? 


= 
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The probabilistic ranking principle was put forward by Maron and Kuhns 
[1960]. The paper by Robertson [1977] identifies some sufficient conditions 
under which the principle is valid and also discusses its limitations. 


32.3 (ADVERSARIAL RANKING AS A SEMI-BANDIT (1)) Frame the adversarial 
variant of the document-based model in Note 6 as a combinatorial semi-bandit 
and use the results in Chapter 30 to prove a bound on the regret of 


Rn < V/2mén(1 + log(@)) . 


32.4 (ADVERSARIAL RANKING AS A SEMI-BANDIT (11)) Adapt your solution to 
the previous exercise to the position-based model in Note 7, and prove a bound 
on the regret of 


Ry < my 2ln(1 + log(£)) . 


32.5 (CYCLES IN PARTIAL ORDER) Prove that if G does not contain cycles, then 
M, defined in Section 32.2 is well defined and that Pi1,..., Pem, is a partition of 


i4. 


32.6 (WORST-CASE BOUND FOR TOPRANK) Prove the second part of 
Theorem 32.2. 
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Pure Exploration 


All the policies proposed in this book so far were 
designed to maximise the cumulative reward. As 
a consequence, the policies must carefully balance 
exploration against exploitation. But what happens a, 
if there is no price to be paid for exploring? Imagine, 
for example, that a researcher has k configurations 
of a new drug and a budget to experiment on 
n mice. The researcher wants to find the most 
promising drug configuration for subsequent human çp; gure 33.1 The mouse never 
trials, but is not concerned with the outcomes benefits from the experiment. 
for the mice. Problems of this nature are called 

pure exploration problems. Although there are 

similarities to the cumulative regret setting, there are also differences. This 
chapter outlines a variety of pure exploration problems and describes the basic 
algorithmic ideas. 


Simple Regret 


Let v be a k-armed stochastic bandit and m = (7,)?4) be a policy. One way to 
measure the performance of a policy in the pure exploration setting is the simple 
regret, 


R(x, v) = Coa [Aang (V)] 3 


The action chosen in round n+ 1 has a special role. In the example with the mice, 
it represents the configuration recommended for further investigation at the end 
of the trial. We start by analysing the uniform exploration (UE) policy, which 
explores deterministically for the first n rounds and recommends the empirically 
best arm in round n + 1. The pseudocode is provided in Algorithm 20. 


THEOREM 33.1. Let m be the policy of Algorithm 20 and v € €%,(1) be a 1- 
subgaussian bandit. Then, for all n > k, 


SIMPLE > F [n/k| Asan 
RY" tv) < main A+ | 5 A;(v) exp (naa 
i:Aji(v)>A 
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: fort=1,...,n do 
Choose A; = 1 + (t mod k) 
end for 


e UN 


: Choose An41 = argmax;e gy Êi(n) 


Algorithm 20: Uniform exploration. 


Proof Let A; = A;(v) and P = P,,. Assume without loss of generality that 
A; = 0, and let į be a suboptimal arm with A; > A. Observe that Any, = i 
implies that fi(n) > fi1(n). Now T;(n) > |n/k] is not random, so by Theorem 5.3 


and Lemma 5.4, 


P (Ais(n) > fis(n)) =P (fu(n) — fis(n) > 0) < exp (24) (33.1) 


The definition of the simple regret yields 


k 
RE" (nv) = SAP (Anu =) SA+ D> AP (Ang =À). 


i=l t:Ap>A 


The proof is completed by substituting Eq. (33.1) and taking the minimum over 
all A > 0. 


The theorem highlights some important differences between the simple regret 
and the cumulative regret. If v is fixed and n tends to infinity, then the simple 
regret converges to zero exponentially fast. On the other hand, if n is fixed and v 
is allowed to vary, then we are in a worst-case regime. Theorem 33.1 can be used 
to derive a bound in this case by choosing A = 2,/log(k)/ |n/k|, which after a 
short algebraic calculation shows that for n > k there exists a universal constant 
C > 0 such that 

Fo (UR) < C fos) for all v € €%,(1). (33.2) 
In Exercise 33.1 we ask you to use the techniques of Chapter 15 to prove that for 
all policies there exists a bandit v € €%-(1) such that Re“""(1,v) > Cy/k/n for 
some universal constant C > 0. It turns out the logarithmic dependence on k in 
Eq. (33.2) is tight for uniform exploration (Exercise 33.2), but there exists another 
policy for which the simple regret matches the aforementioned lower bound up to 
constant factors. There are several ways to do this, but the most straightforward 
is via a reduction from algorithms designed for minimising cumulative regret. 


PROPOSITION 33.2. Let n = (m)f_, be a policy, and define 


, 1X , 
Tnt1(t|@1,21,---,An,Ln) = -> Ha =%}. 
t=1 
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Then the simple regret of (m) 2} satisfies 


Reme((m je 2 
where R,(m,v) is the cumulative regret of policy n = (mi); on bandit v. 


Proof By the regret decomposition identity (4.5), 


R,(a,v) = nE 


= nk [Manel = nRa (Gai v) j 


where the first equality follows from the definition of the cumulative regret, the 
third from the definition of 7,4; and the last from the definition of the simple 
regret. 


An immediate corollary of the previous proposition is that the minimax simple 
regret over k-armed bandits is of the order O(\/k/n). 


COROLLARY 33.3. There exist a constant 0 < C such that for all n,k > 1, 


inf sup RY" (r,v) <CVk/n. 
T  veek(1): 
A(v)E[0,1]* 


Proof Combine the previous result with Theorem 9.1. 


Proposition 33.2 raises our hopes that policies designed for minimising the 
cumulative regret might also have well-behaved simple regret. Indeed, this is true 
in a worst-case sense, as attested by Exercise 33.1 and Proposition 33.2. However, 
policies designed to minimise cumulative regret are wasteful when used on “easy” 
instances. This is because these policies spend most of their time playing the 
optimal arm and play suboptimal arms just barely enough to ensure they are not 
optimal. In pure exploration this leads to a highly suboptimal policy for which 
the simple regret is asymptotically polynomial,while we know from Theorem 33.1 
that the simple regret should decrease exponentially fast. More details on the 
suboptimality of cumulative regret minimisation algorithms, as well as pointers 
to the literature are given in Note 2 at the end of the chapter. 


Best-Arm Identification with a Fixed Confidence 


Best-arm identification is a variant of pure exploration where the learner is 
rewarded only for identifying an exactly optimal arm. There are two variants of 
best-arm identification. In this section we consider the fixed confidence setting 
when the learner is given a confidence level 6 € (0,1) and should use as few 
samples as possible to output an arm that is optimal with probability at least 
1 — ô. In the other variant the learner has to make a decision after n rounds and 
the goal is to minimise the probability of selecting a suboptimal arm. We treat 
this alternative in the next section. 
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In the fixed confidence setting, the learner chooses a policy m = (7)? 
as normal. The number of rounds is not fixed in advance, however, the 
learner chooses a stopping time 7 adapted to filtration F = (F;)?29 with 
F; = 0( Ai, X1,..., Az, Xz). The learner also chooses a F;-measurable random 
variable 7 taking values in [k]. The stopping time represents the time when the 
learner halts and w € [k] is the recommended action, which by the measurability 
assumption only depends on (A1, X1,...,A;,X7). Note that in line with our 
definition of stopping times (see Definition 3.6), it is possible that T = 00, which 
just means the learner cannot ever make up their mind to stop. This behaviour 
of a learner, of course, will not be encouraged! The function 7% is called the 
selection rule. 


DEFINITION 33.4. A triple (7,7,w) is sound at confidence level 6 € (0,1) for 
environment class £ if for all v € €, 


Pun (T < œ and Ay(v) > 0) <6. (33.3) 


The objective in fixed confidence best-arm identification is to find a sound 
learner for which E,,,[7] is minimised over environments v € E. Since this is a 


multi-objective criteria, there is a priori no reason to believe that a single optimal 
learner should exist. Conveniently, however, the condition that the learner must 
satisfy Eq. (33.3) plays the role of the consistency assumption in the asymptotic 
lower bounds in Chapter 16, which allows for a sense of instance-dependent 
asymptotic optimality. The situation in finite time is more complicated, as we 
discuss in Note 7. 


If € is sufficiently rich and v has multiple optimal arms, then no sound learner 
can stop in finite time with positive probability. The reason is that there is 
no way to reject the hypothesis that one optimal arm is fractionally better 
than another. You will investigate this in Exercise 33.10. Also note that 
in our definition, I {r = t} is a deterministic function of A1, X1,..., Az, Xi. 
None of the results that follow would change if you allowed 7 or w to also 
depend on some exogenous source of randomness. 


Lower Bound 


We start with the lower bound, which serves as a target for the upper bound to 
follow. Let € be an arbitrary set of k-armed stochastic bandit environments, and 
for v € E define 


i* (v) =argmaxjeyj Mi(v) and Ear(v) = {r EE: *(V')Ni*(v) = O}, 


which is the set of bandits in € with different optimal arms than v. 


33.2 Best-Arm Identification with a Fixed Confidence 407 


THEOREM 33.5. Assume that (n, T, Y) is sound for E at confidence level 6 € (0,1), 
and let v € E. Then E,,[T] > c*(v) log (4), where 


k 
eu = Ae Lato 2° (vi, vi) (33.4) 


with c*(v) = œ when œ (v)! = 0. 


Proof The result is trivial when E,,,[T] = oo. For the remainder, assume that 
čpr|T] < œ, which implies that P,,(7 = co) = 0. Next, let v’ € Ear (v) and 
define event E = {r < œ and 4 ¢ i*(v’)} € F,. Then, 


26 > Pyr(T < œ and Y ¢ i*(v)) + Pya(t < œ and Y ¢ i*(v’)) 


2 
= Pia (E*) + Pia (E£) 


k 
> L exp -X WA moDt) ; (33.5) 


i=l 


Vv 


where the first inequality follows from the definition of soundness and the last 
from the Bretagnolle-Huber inequality (Theorem 14.2) and the stopping time 
version of Lemma 15.1 (see Exercise 15.7). The second inequality holds because 
Piz(7 = œ) = 0 and i* (v) N i* (v) = 0 and 


= {Tr = œ}U {r < œ and y% € i*(v')} 
C {r= œ}U{r <œ and Y ¢ i*(v)}. 


II 


Rearranging Eq. (33.5) shows that 


k 
1 
D lum [Ti(T)] Di, 14) > log (5) (33.6) 
i=1 
which implies that E,,,[7] > 0. Using this, the definition of c*(v) and Eq. (33.6), 
ly [7] = prt] inf Sra (viv r) 
cœ (v) oe 1 €Eait (V) 4 
k 
, tuz lTi(T)] / 
> Enz f ——— D(v; ni 33.7 
-e 2 Endr (vi, vi) (33.7) 
k 
= inf lion (Le(7)| D (vi, vi) 


vy’ Eat (Vv) 


1 
> log (a) 


where the last inequality follows from Eq. (33.6). Rearranging completes the proof. 
Note, in the special case that c*(v)~! = 0, the assumption that E,,[T] < co 
would lead to a contradiction. 


=1 
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Theorem 33.5 does not depend on € being unstructured. The assumption 
that the bandits are finite armed could also be relaxed with appropriate 
measureability assumptions. 


In a moment, we will prove that the bound in Theorem 33.5 is asymptotically 
optimal as 6 + 0 when E = EẸ (1), a result that holds more generally for 
Bernoulli bandits or when the distributions come from an exponential family. 
Before this, we devote a little time to understanding the constant c*(v). Suppose 
that a*(v) € Pr_1 satisfies 


ele) = inf al (v) Dj, vi). 
vy’ Eat (v) i 


A few observations about this optimisation problem: 


(a) The value of a*(v) is unique when € = €4-(1) and v € E has a unique 
optimal arm. Uniqueness continues to hold when € is unstructured with 
distributions from an exponential family. 

(b) The inequality in Eq. (33.7) is tightest when E,,[T;(7)|/Ev2[7] = až (v), 
which shows a policy can only match the lower bound by playing arm i 


exactly in proportion to a¥(v) in the limit as 6 tends to zero. 
(c) When E = €%-(1) and v € € has a unique optimal arm, then 


(v) = ; ae mo T {a(ua(v) — m0)? + (1 — a) (uw) — u(i’)? } 


1 1 
=5 sup a(1—a) (u(y) — pe(v))” = z (aly) — pa(v))° 
a€[0,1] 

In this case we observe that aj(v) = a3(v) = 1/2. 
(d) Suppose that o? € (0, 00)* is fixed and E = {(N(ji,07)*_, : u € R*}. You 

are asked in Exercise 33.4 to verify that when k = 2, 

‘ 2(01 +02)? 
2 
which unsurprisingly shows the problem becomes harder as the variance of 


either of the arms increases. In Exercise 33.4, you will show when k > 2, it 
holds that 


where Amin = minjz;- A; is the smallest suboptimality gap. This bound 
faithfully captures the intuition that each suboptimal arm must be played 
sufficiently often to be distinguished from the optimal arm, while the optimal 
arm must be observed sufficiently many times so that it can be distinguished 
from the second best arm. For k = 2, this bound is smaller than the value 
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of c*(v), as shown in (33.8), showing that there is room for improvement in 
this case. 


Policy, Stopping/Selection Rule and Upper Bounds 


The bound in Theorem 33.5 is asymptotically tight for many environment classes. 
For simplicity, we focus on the Gaussian case. 


For this section, we assume that E = €X,(1) is the set of k-armed Gaussian 
bandits with unit variance. 


We need to construct a triple (7,7, Y) that is sound for € and for which E_,[7] 
matches the lower bound in Theorem 33.5 as ô + 0. Both are derived using 
the insights provided by the lower bound. The policy should choose action 7 in 
proportion to a*(v), which must be estimated from data. The stopping rule is 
motivated by noting that Eq. (33.6) implies that a sound stopping rule must 
satisfy 


k 
1 
XO EvelTi(7)| D(vi, v;) 2 log (=) for all v’ € Ear(v). 
i=1 

If the inequality is tight, then we might guess that a reasonable stopping rule as 
the first round t when 

1 
inf T,(t) D(vi, v; lo 
scl yy ZTO Dlr) Z loe (5) 

There are two problems: (a) v is unknown, so the expression cannot be evaluated, 
and (b) we have replaced the expected number of pulls with the actual number of 
pulls. Still, let us persevere. To deal with the first problem, we can try replacing 


v by the Gaussian bandit environment with mean vector fi(t), which we denote 
by ô(t). Then let 


k k 
1 
Z,= inf T,(t) D(t), v4) = = i(v'))?. 
ee (£) D(C), v) TE — pi(v')) 


We will show there exists a choice of 3,(0) ~ log(t/d) such that if r = min{t: Z; > 
(1(d)}, then the empirically optimal arm at time 7 is the best arm with probability 
at least 1 — 6. The next step is to craft a policy for which the expectation of T 
matches the lower bound asymptotically. As we remarked earlier, if the policy 
is to match the lower bound, it should play arm i approximately in proportion 
to a*(v). This suggests estimating a*(v) by â(t) = a*((t)) and then playing 
the arm for which tâ;(t) — T;(t) is maximised. If @(t) is inaccurate, then perhaps 
the samples collected will not allow the algorithm to improve its estimates. To 
overcome this last challenge, the policy includes enough forced exploration to 
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ensure that eventually â(t) converges to a*(v) with high probability. Combining 
all these ideas leads to the track-and-stop policy (Algorithm 21). 


1: Input 6 and (;(6) 

2: Choose each arm once and set t= k 

3: while Z; < bl )d 

4: if argmin;¢(j,) nite ) < vt then 

5: Choose Ay+1 = argmin;e 4) Ti(t) 

6: else 

7: Choose Ay+1 = argmax;e (tâ; (t) — Ti(t)) 

8: end if 

9: Observe reward X;41, update statistics and increment t 
10: end while 


BR 
=. 


: return Y% = i*(0(t)), T=t 


Algorithm 21: Track-and-stop. 


THEOREM 33.6. Let (n, T, Y) be the policy, stopping time and selection rule of 
track-and-stop (Algorithm 21). There exists a choice of 6,(6) such that track-and- 
stop is sound and for allv € E with |i*(v)| = 1 it holds that 


VT T * 
a oot) 7 on 

Note that only m does not depend on 6 inside the limit statement of the theorem, 
but the stopping time does. The following lemma guarantees the soundness of 
(m, T, p). 
LEMMA 33.7. Let f : [k,0o) —> R be given by f(x) = exp(k — x)(x/k)* and 
Bil) = klog(t? +t) + f-1(6). Then, for T = min{t : Z > B,(5)}, it holds that 
P(i*(O(r)) Fi*(Y)) < ô. 


The inverse f~1(6) is well defined because f is strictly decreasing on [k, c0) 
with f(k) = 1 and lims f(z) = 0. In fact, the inverse has a closed- 
form solution in terms of the Lambert W function. By staring at the form 
of f one can check that lims—o f~1(6)/log(1/6) = 1 or equivalently that 


f-*(6) = (1 + o(1)) log(1/6). 


Proof of Lemma 33.7 Notice that |i*(@(t))| > 1 implies that Z, = 0. Hence 
\i*(0(7r))| = 1 for T < ov, and the selection rule is well defined. Abbreviate 
u = (v) and A = A(v), and assume without loss of generality that A; = 0. By 
the definition of 7 and Z;, 


{v E Earth WT mets Dn) — m)? = s0) ` 
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Using the definition of €,1,(0(7)) yields 


P(1 €2*(0(7))) =P (v € Ex (P( ner(35 dA) ~ pj)? > 216) 


Then apply Lemma 33.8 and Proposition 33.9 from Section 33.2.3. 


A candidate for 6;(0) can be extracted from the proof and satisfies 8;(8) ~ 
2k log t + log(1/6). This can be improved to approximately k log log(t) + log(1/0) 
by using a law of the iterated logarithm bound instead of Lemma 33.8. Below, 
we sketch the proof of Theorem 33.6. A more complete outline is given in 
Exercise 33.6. 


Proof sketch of Theorem 33.6 Lemma 33.7 shows that (7,7) are sound. It 
remains to control the expectation of the stopping time. The intuition is 
straightforward. As more samples are collected, we expect that @(t) ~ a*(v) and 
fi = u and 


k ji 


= (0)? 
a= cee t)) 2 
si e Y EOU) = wal)? 
DEEan (v) 2 
an 
eE 


Provided the approximation is reasonably accurate, the algorithm should halt 
once 


zg Z 2O) = (+ o1) log 4/6), 


which occurs once t > (1 + 0(1))c*(v) log(1/6). 


Concentration 
The first concentration theorem follows from Corollary 5.5 and a union bound. 


LEMMA 33.8. Let (X+) be a sequence of independent Gaussian random variables 
with mean u and unit variance. Let fin = DD Xı. Then 


P (erists neNt: 5 (iin — p)? > log(1/8) + log(n(n + 1))) <ô. 


As we remarked earlier, the log(n(n + 1)) term can be improved to 
approximately loglog(n). You can do this using peeling (Chapter 9) or the 
method of mixtures (Exercise 20.9). Since (X+) are Gaussian, you can also use 
the tangent approximation and the Bachélier-Levy formula (Exercise 9.4). 
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PROPOSITION 33.9. Let g : N > R be increasing, and for each i € |k], let 
Sii, 5i2,-.. be an infinite sequence of random variables such that for all 6 € (0,1), 
P (exists s © N : Sis > g(s) + log(1/d)) < ô. 


Then, provided that (Si)$; are independent and x > 0, 


P Ga seN*: 5 Sis, > kg (>: s) + z) < (Z) expe —2). 


i=l i=1 


Proof For i € [k], let W; = max{w € [0,1]: Sis < g(s)+log(1/w) for all s € N}, 
where we define log(1/0) = oo. Note that W; are well defined. Then, for any 
s e NF, 


k k k k k 
2 Sis, < > (si) + > log(1/W;) < kg (>: s) F > log(1/W;) . 


By assumption, (W;)*_, are independent and satisfy P(W; <x) < x for all 
x € [0,1]. The proof is completed using the result of Exercise 5.16. 


Best-Arm Identification with a Budget 


In the fixed-budget variant of best-arm identification, the learner is given the 
horizon n and should choose a policy m = (7,)?1)' with the objective of minimising 
the probability that A,,1 is suboptimal. The constraint on the horizon rather 
than the confidence level makes this setting a bit more nuanced than the fixed 
confidence setting, and the results are not as clean. 

A naive option is to use the uniform exploration policy, but as discussed in 
Section 33.1, this approach leads to poor results when the suboptimality gaps 
are not similar to each other. To overcome this problem, the sequential halving 
algorithm divides the budget into L = [log,(k)] phases. In the first phase, the 
algorithm chooses each arm equally often. The bottom half of the arms are then 
eliminated, and the process is repeated. 


THEOREM 33.10. Ifv € E&(1) has mean vector u = p(v) and mı > -+ > uk and 
m is sequential halving, then 


n 
Pyr(AAn4, > 0) < 3log3(k) exp ( 16 Ho(p) log =) 
2 


where Hə(u) = max;.a,s0 a. 


The assumption on the ordering of the means is only needed for the clean 
definition of H2, which would otherwise be defined by permuting the arms. The 
algorithm is completely symmetric. In Exercise 33.8 we guide you through the 
proof of Theorem 33.10. 

The quantity Hə(u) looks a bit unusual, but arises naturally in the 
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: Input nandk 
: Set L = [log,(k)| and A; = [A]. 
: for £= 1,..., L do 
Let T; = | ata]: 
Choose each arm in A, exactly Ty times 
For each i € Ay compute fif as the empirical mean of arm i based on the 
last Ty samples 
7: Let Agyi contain the top [|A¢|/2] arms in Ae 
8: end for 
9: return A,,+; as the arm in Az44 


Algorithm 22: Sequential halving. 


analysis. It is related to a more familiar quantity as follows. Define Hi(j1) = 
Soe, min{1/A?,1/A2,,,}. Then 


Ho(y1) < Hı (u) < (1+ log(k))H2(4) (33.9) 


Furthermore, both inequalities are essentially tight (Exercise 33.7). Let’s see how 
the bound in Theorem 33.10 compares to uniform exploration, which is the same 
as Algorithm 20. Like in the proof of Theorem 33.1, the probability that uniform 
exploration selects a suboptimal arm is easily controlled using Theorem 5.3 and 
Lemma 5.4: 


2 
Pauela >0)< D Paaa) E e (24e) | 
i:Ai>0 t:Ai>0 
Suppose that A = A> = Az, so that all suboptimal arms have the same 
suboptimality gap. Then Hz = k/A? and terms in the exponent for sequential 
halving and uniform exploration are 9(nA?/(k log k)) and O(nA?/k), respectively, 
which means that uniform exploration is actually moderately better than 
sequential halving, at least if n is sufficiently large. On the other hand, if Ap = A 
is small, but A; = 1 for all i > 2, then Hz = O(1/A?) and the exponents are 
@Q(nA?) and O(nA?/k) respectively and sequential halving is significantly better. 
The reason for the disparity is the non-adaptivity of uniform exploration, which 
wastes many samples on arms i > 2. Although there are not asymptotically 
matching upper and lower bounds in the fixed budget setting, the bound of 

sequential halving is known to be roughly optimal. 


Notes 


1 The problems studied in this chapter belong to the literature on stochastic 
optimisation, where the simple regret is called the expected suboptimality. 
There are many variants of pure exploration. In the example at the start of the 
chapter, a medical researcher may be interested in getting the most reliable 
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information about differences between treatments. This falls into the class of 
pure information-seeking problems, the subject of optimal experimental design 
from statistics, which we have met earlier. 

We mentioned that algorithms with logarithmic cumulative regret are not well 
suited for pure exploration. Suppose 7 has asymptotically optimal cumulative 
regret on € = EX, which means that limn—oo Ey,[Ti(n)]/log(n) = 2/A;(v) for 
all v € E. You will show in Exercise 33.5 that for any € > 0, there exists a 
v € E with a unique optimal arm such that 


— log (Pyr(An+1 Z i*(v))) 
n—=+o0 log(n) 


<l+te. 


This shows that using an asymptotically optimal policy for cumulative regret 
minimisation leads to a best-arm identification policy for which the probability 
of selecting a suboptimal arm decays only polynomially with n. This result 
holds no matter how An+1 is selected. 

A related observation is that the empirical estimates of the means after running 
an algorithm designed for minimising the cumulative regret tend to be negatively 
biased. This occurs because these algorithms play arms until their empirical 
means are sufficiently small. 

Although there is no exploration/exploitation dilemma in the pure exploration 
setting, there is still an ‘exploration dilemma’ in the sense that the optimal 
exploration policy depends on an unknown quantity. This means the policy 
must balance (to some extent) the number of samples dedicated to learning 
how to explore relative to those actually exploring. 

Best-arm identification is a popular topic that lends itself to simple analysis 
and algorithms. The focus on the correct identification of an optimal arm 
makes us question the practicality of the setting, however. In reality, any 
suboptimal arm is acceptable provided its suboptimality gap is small enough 
relative to the budget, which is more faithfully captured by the simple 
regret criterion. Of course the simple regret may be bounded naively by 
Re" < max; A;P (A4,,, > 0), which is tight in some circumstances and 
loose in others. 

An equivalent form of the bound shown in Theorem 33.5 is 


k 
tun [T |> min | So : @1,.--;,@k > 0, wi, ga D(vi, vi) > tan} 


This form follows immediately from Eq. (33.6) by noting that E,,[7T] = 
X; Evz[T:(7T)]. The version given in the theorem is preferred because it is 
a closed form expression. Exercise 33.3 asks you to explore the relation between 
the two forms. 

The forced exploration in the track-and-stop algorithm is sufficient for 
asymptotic optimality. We are uneasy about the fact that the proof would work 
for any threshold Ct? with p € (0,1). There is nothing fundamental about vt. 
We do not currently know of a principled way to tune the amount of forced 
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exploration or if there is better algorithm design for best-arm identification. 
Ideally one should provided finite-time upper bounds that match the finite-time 
lower bound provided by Theorem 33.5. The extent to which this is possible 
appears to be an open question. 

The choice of ;(0) significantly influences the practical performance of track- 
and-stop. We believe the analysis given here is mostly tight except that the 
naive concentration bound given in Lemma 33.8 can be improved using a 
finite-time version of the law of the iterated logarithm (see Exercise 20.9, for 
example). 


oo 


No} 


Perhaps the most practical set-up in pure exploration has not yet received any 
attention, which is upper and lower instance-dependent bounds on the simple 
regret. Even better would be to have an understanding of the distribution of 
Aa 


n+l’ 
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In the machine learning literature, pure exploration for bandits seems to have 
been first studied by Even-Dar et al. [2002], Mannor and Tsitsiklis [2004] and 
Even-Dar et al. [2006] in the ‘Probability Approximately Correct’ setting, where 
the objective is to find an e-optimal arm with high probability with as few samples 
as possible. After a dry spell, the field was restarted by Bubeck et al. [2009] and 
Audibert and Bubeck [2010b]. The asymptotically optimal algorithm for the fixed 
confidence setting of Section 33.2 was introduced by Garivier and Kaufmann 
[2016], who also provide results for exponential families as well as in-depth 
intuition and historical background. Degenne and Koolen [2019] and Degenne 
et al. [2019] have injected some new ideas into the basic principles of track-and- 
stop by incorporating a kind of optimism and solving the optimisation problem 
incrementally using online learning, which leads to theoretical and practical 
improvements. A similar problem is studied in a Bayesian setting by Russo 
[2016], who focuses on designing algorithms for which the posterior probability of 
choosing a suboptimal arm converges to zero exponentially fast with an optimal 
rate. Even more recently, Qin et al. [2017] designed a policy that is optimal in both 
the frequentist and Bayesian settings. The stopping rule used by Garivier and 
Kaufmann [2016] is inspired by similar rules by Chernoff [1959]. The sequential 
halving algorithm is by Karnin et al. [2013], and the best summary of lower 
bounds is by Carpentier and Locatelli [2016]. Besides this there have been many 
other approaches, with a summary by Jamieson and Nowak [2014]. The negative 
result discussed in Note 2 is due to Bubeck et al. [2009]. Pure exploration has 
recently become a hot topic and is expanding beyond the finite-armed case. For 
example, to linear bandits [Soare et al., 2014] and continuous-armed bandits 
[Valko et al., 2013a], tree search [Garivier et al., 2016a, Huang et al., 2017a] and 
combinatorial bandits [Chen et al., 2014, Huang et al., 2018]. 

The continuous-armed case is also known as zeroth-order (or derivative- 
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free) stochastic optimisation and is studied under various assumptions on 
the unknown reward function, usually assuming that A C Rt. Because of the 
obvious connection to optimisation, this literature usually considers losses, or 
cost, rather than reward, and the reward function is then called the objective 
function. A big part of this literature poses only weak assumptions, such as 
smoothness, on the objective function. Note that in the continuous-armed case, 
regret minimisation may only be marginally more difficult than minimising the 
simple regret because even the instance-dependent simple regret can decay at 
a slow, polynomial rate. While the literature is vast, most of it is focused on 
heuristic methods without rigorous finite-time analysis. Methods developed for 
this case maintain an approximation to the unknown objective and often use 
branch-and-bound techniques to focus the search for the optimal value. For a 
taster of the algorithmic ideas, see [Conn et al., 2009, Rios and Sahinidis, 2013]. 
When the search for the optimum is organised cleverly, the methods can adapt to 
‘local smoothness’ and enjoy various optimality guarantees [Valko et al., 2013a]. 
A huge portion of this literature considers the easier problem of finding a local 
minimiser, or just a stationary point. Another large portion of this literature 
is concerned with the case when the objective function is convex. Chapter 9 of 
the classic book by Nemirovsky and Yudin [1983] describes two complementary 
approaches (a geometric, and an analytic) and sketches their analysis. For the 
class of strongly convex and smooth functions, it is known that the minimax 
simple regret is O(,/d?/n) [Shamir, 2013]. The main outstanding challenge is to 
understand the dependence of simple regret on the dimension beyond the strongly 
convex and smooth case. Hu et al. [2016] prove a lower bound of Q(n~1/3) on 
the simple-regret for algorithms that construct gradient estimates by injecting 
random noise (as is done by Katkovnik and Kulchitsky [1972], Nemirovsky and 
Yudin [1983] and others), which, together with the O(n~'/?) upper bound by 
Nemirovsky and Yudin [1983] (see also Agarwal et al. 2013, Liang et al. 2014), 
establishes the inferiority of this approach in the n > d regime. Interestingly, 
empirical evidence favours these gradient-based techniques in comparison to the 
‘optimal algorithms’. Thus, much room remains to improve our understanding of 
this problem. This setting is to be contrasted to the one when unbiased noisy 
estimates of the gradient are available where methods such as mirror descent 
(see Chapter 28) give optimal rates. This is a much better understood problem 
with matching lower and upper bounds available on the minimax simple regret 
for various settings (for example, Chapter 5 of Nemirovsky and Yudin [1983], or 
Rakhlin et al. [2012]). 

Variants of the pure exploration problem are studied in a branch of statistics 
called ranking and selection. The earliest literature on ranking and selection 
goes back to at least the 1950s. A relatively recent paper that gives a glimpse 
into a small corner of this literature is by Chan and Lai [2006]. The reason we 
cite this paper is because it is particularly relevant for this chapter. Using our 
terminology, Chan and Lai consider the PAC setting in the parametric setting 
when the distributions underlying the arms belong to some known exponential 
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family of distributions. A procedure that is similar to the track-and-stop procedure 
considered here is shown to be both sound and asymptotically optimal as the 
confidence parameter approaches one. We also like the short and readable review 
of the literature up to the 1980s from the perspective of simulation optimisation 
by Goldsman [1983]. 

A related setting studied mostly in the operations research community is 
ordinal optimisation. In its simplest form, ordinal optimisation is concerned 
with finding an arm amongst the ak arms with the highest pay-offs. Ho et al. [1992], 
who defined this problem in the stochastic simulation optimisation literature, 
emphasised that the probability of failing to find one of the ‘good arms’ decays 
exponentially with the number of observations n per arm, in contrast to the 
slow n~!/? decay of the error of estimating the value of the best arm, which 
this literature calls the problem of cardinal optimisation. Given the results 
in this chapter, this should not be too surprising. A nice twist in this literature 
is that the error probability does not need to depend on k (see Exercise 33.9). 
The price, of course, is that the simple regret is in general uncontrolled. In a 
way, ordinal optimisation is a natural generalisation of best-arm identification. 
As such, it also leads to algorithmic choices that are not the best fit when the 
actual goal is to keep the simple regret small. Based on a Bayesian reasoning, a 
heuristic expression for the asymptotically optimal allocation of samples for the 
Gaussian best-arm identification problem is given by Chen et al. [2000]. They 
call the problem of finding an optimal allocation the ‘optimal computing budget 
allocation’ (OCBA) problem. Their work can be viewed as the precursor to the 
results in Section 33.2. Glynn and Juneja [2015] gives further pointers to this 
literature, while connecting it to the bandit literature. 

Best-arm identification has also been considered in the adversarial setting 
[Jamieson and Talwalkar, 2016, Li et al., 2018, Abbasi-Yadkori et al., 2018]. 
Another related setting is called the max-armed bandit problem, where the 
objective is to obtain the largest possible single reward over n rounds [Cicirello 
and Smith, 2005, Streeter and Smith, 2006a,b, Carpentier and Valko, 2014, Achab 
et al., 2017]. 


Exercises 


33.1 (SIMPLE REGRET LOWER BOUND) Show there exists a universal constant 
C > 0 such that for all n > k > 1 and all policies 7, there exists a v € E€ k, with 
A(v) € [0,1]* such that R™"*(1,v) > C./k/n. 


33.2 (SUBOPTIMALITY OF UNIFORM EXPLORATION) Show there exists a universal 
constant C > 0 such that for all n > k > 1, there exists a v € EX, with 


A(v) € [0,1]* such that Re“"*(UE, v) > Cy/klog(k)/n. 
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33.3 Let L > 0 and D C [0,co)* \ {0} be non-empty. Show that 


= 
in {lla a € [0,00)", inf (a, d) > 1} (x sot) L 

33.4 (BEST-ARM IDENTIFICATION FOR GAUSSIAN BANDITS) Let o7,...,0% be 

fixed and E = {(N (m, 07))*_, : u € RË} be the set of Gaussian bandits with given 

variances. Let v € E be a bandit with uı(v) > p;(v) for all i > 1. Abbreviate 

pe = wv) and A = A(v). 


(a) For any a € [0,0o)* show that 


aya; A? 


i 


m 2 7 
2 i>l ayo + Qoi 


(b) Show that if k = 2, then œ (v) = 2(01 + 02)? / A2. 


20? E 20 
$ 1 
(c) Show that c*(v) > eee + J : 


(da) Show that 


2 
20i 


(33.10) 


(e) Show that if o? /A? = o? /A2 in for all i, then equality holds in Eq. (33.10). 


min 


33.5 (SUBOPTIMALITY OF CUMULATIVE REGRET ALGORITHMS FOR BEST-ARM 
IDENTIFICATION) Suppose 7 is an asymptotically optimal bandit policy in E = € k, 
in the sense that 


lim Ranvi 5 a foralveé. 
noo log(n) i:A;(v)>0 Alv) 


(a) For any e€ > 0, prove there exists a v € € with a unique optimal arm such 
that 


ee a log(Prr(Aany: > 0)) 
lim inf 
n—0o log(n) 


<li+e. 


(b) Can you prove the same result with liminf replaced by lim sup? 
(c) What happens if the assumption that 7 is asymptotically optimal is replaced 
with the assumption that there exists a universal constant C > 0 such that 


yy « alr) 
Pa l i © (a. pa No) 


33.6 (ANALYSIS OF TRACK-AND-STOP) In this exercise, you will complete the 
proof of Theorem 33.6. Assume that v has a unique optimal arm. Make € a 
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metric space via the metric d(v1, v2) = ||u(v1) — u(v2) ||. Let € > 0 be a small 
constant, and define random times 


1+ max {t : d(%,v) > €} 
Ta(e) = 1+ max {t : ||a* (v) — a* (tllo > €} 
1+ max {t : ||T(t)/t — až (vllo 2 €}. 


Note, these are not stopping times. Do the following: 


(a) Show that a*(v) is unique. 

(b) Show a* is continuous at v. 

(c) Prove that E[r, (£)] < co for all € > 0. 
(d) Prove that E[7,.(¢)] < co for all € > 0. 
(e) Prove that E[rr(e)] < oo for alle > 0. 
(£) Prove that lims_,9 E[r]/log(1/d) < c* (v). 


33.7 (COMPLEXITY MEASURE COMPARISON) Prove the following: 


(a) Let L=1+ Di ; and show that Hə(u) < Hı(u) < LH2(u). Combine this 
with the fact that L < 1 + log(k) to prove that Eq. (33.9) holds. 

(b) Find p and p such that Ho(u) = Hı (u) and Ay(y’) = LH2(p'). Conclude 
the inequalities in Eq. (33.9) are tight. 


33.8 (ANALYSIS OF SEQUENTIAL HALVING) The purpose of this exercise is to 
prove Theorem 33.10. Assume without loss of generality that u = p(v) satisfies 
fly È Ug >... > up. Given a set A C [k], let 


ToPM(A,m) = 4 i€ [k]: XIJ EA} <m 


j<i 
be the top m arms in A. To make life easier, you may also assume that k is a 
power of two so that |A| = k2'~* and Ty = n2°~!/logs(k). 


(a) Prove that |Az41| = 1. 
(b) Let i be a suboptimal arm in Az, and suppose that 1 € Ag. Show that 


P (at < Aj 


TA? 
i € Ag, 1 E€ Ag) < exp (-=5) . 


(c) Let A, = Ae \ TOPM (Ae, [|Ae|/4]) be the bottom three-quarters of the arms 
in round £. Show that if the optimal arm is eliminated after the /th phase, 
then 


1 
=> I{pf> = > z Mel. 


ic A, 


(a) Let ig = min A, and show that 


A2n2!-! nA? 
; ? j a) ae — a) 
[Ne Ad < Mil magesp (- Teg ) < Miles (62 ey) 
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(e) Combine the previous two parts with Markov’s inequality to show that 


TA? 
P(L € Acyi|1 € Ac) < 3exp ~ 16 logs (k)ie ) ` 
2 


(£) Join the dots to prove Theorem 33.10. 


33.9 Let P be a distribution over the measurable set ¥, u : Æ — [0,1] be 
measurable, a,d € (0,1), and define už = inf{y : P(u(X) < y) > 1— a}. Show 
that if n > log(1/6)/log(1/(1 — a)), then for X1,..., Xn ~ P independent, with 
probability 1 — ô, it holds that maX;ejn] w(Xi) > uy. 


33.10 (MULTIPLE OPTIMAL ARMS AND SOUNDNESS) Throughout this exercise, 
let k > 1. 


(a) Let E = EX-(1). Prove that for any sound pair (7,7) and v € E with 
|i*(v)| > 1, it holds that P,,(7 = co) = 1. 

(b) Repeat the previous part with € = €f. 

(c) Describe an unstructured class of k-armed stochastic bandits € and v € € 
with |2*(v)| > 1 and sound pair (7,7) for which P,,(7 = œ) = 0. 


33.11 (PROBABLY APPROXIMATELY CORRECT ALGORITHMS) This exercise is 
about designing (£, ô)-PAC algorithms. 


(a) For each £ > 0 and ô € (0,1) and number of arms k > 1, design a policy 7 
and stopping time 7 such that for all v € E, 


Ck k 
Pyr(Aa. >e) <6 and Vya lt) Z a log (+) F 
for universal constant C > 0. 
(b) It turns out the logarithmic dependence on k can be eliminated. Design a 
policy a and stopping time 7 such that for all v € E, 


Ck 1 
Pra(Qa, 22) <6 and Eyalr] s Gog (2). 
(c) Prove a lower bound showing that the bound in part (b) is tight up to 
constant factors in the worst case. 


HINT Part (b) of the above exercise is a challenging problem. The simplest 
approach is to use an elimination algorithm that operates in phases where at 
the end of each phase, the bottom half of the arms (in terms of their empirical 
estimates) are eliminated. For details, see the paper by Even-Dar et al. [2002]. 
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Foundations of Bayesian Learning 


Bayesian methods have been used for bandits from the beginning of the field and 
dominated research from 1950 until 1980. This chapter introduces the Bayesian 
viewpoint and develops the technical tools necessary for applications in bandits. 
Readers who are already familiar with the measure-theoretic Bayesian analysis 
can skim Sections 34.4 and 34.6 for the notation used in subsequent chapters. 


Statistical Decision Theory and Bayesian Learning 


The fundamental challenge in learning problems is that the true environment 
is unknown and policies that are optimal in one environment are usually not 
optimal in another. This forces the user to make trade-offs, balancing performance 
between environments. We have already discussed this in the context of finite- 
armed bandits in Part IV. Here we take a step back and consider a more general 
set-up. 

Let E be a set of environments and II a set of policies. These could be bandit 
environments/policies, but for now an abstract view is sufficient. A loss function 
is a mapping £ : E x II > R with ¢(v, m) representing the loss suffered by policy m 
in environment v. Of course you should choose a policy that makes the loss small, 
but most choices are incomparable because the loss depends on the environment. 
Fig. 34.1 illustrates a typical situation with four policies. Some policies can 
be eliminated from consideration because they are dominated, which means 
they suffer at least as much loss as some other policy on all environments and 
more loss on at least one. A policy that is not dominated is called admissible 
or Pareto optimal . Choosing between admissible policies is non-trivial. One 
canonical choice of admissible policy (assuming it exists) is a minimax optimal 
policy 7 € argmin,, sup, ¢(v, 7’). Minimax optimal policies enjoy robustness, but 
the price may be quite large on average. Would you choose the minimax optimal 
policy in the example in Fig. 34.1? 

In the Bayesian viewpoint, the uncertainty in the environment is captured by 
choosing a prior probability measure on € that reflects the user’s belief about the 
environment the learner will face. Having committed to a prior, the Bayesian 
optimal policy simply minimises the expected loss with respect to the prior. 
When £ is countable, a measure corresponds to a probability vector q € P(E), 
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1 Tı, admissible 


N 

Q 

fe} ; 

4 T2, dominated 
73, minimax optimal 
m4, admissible 

0 
0 1 
Environments 
Figure 34.1 Loss as a function of the environment for four different polices 71,..., 74, 


when £ = [0,1]. Which policy would you choose? 


and the Bayesian optimal policy with respect to q is an element of 


argmax,, 5 q(v)e(y, 7) . 
vee 
The Bayesian viewpoint is hard to criticise when the user really does know the 
underlying likelihood of each environment and the user is risk-neutral. Even 
when the distribution is not known exactly, however, sensible priors often yield 
provably sensible outcomes, regardless of whether one is interested in the average 
loss across the environments, or the worst-case loss, or some other metric. 


A distinction is often made between the Bayesian and frequentist viewpoints, 
which naturally leads to heated discussions on the merits of one viewpoint 
relative to another. This debate does not interest us greatly. We prefer to 
think about the pros and cons of problem definitions and solution methods, 
regardless of the label on them. Bayesian approaches to bandits have their 
strengths and weaknesses, and we hope to do them a modicum of justice 
here. 


Bayesian Learning and the Posterior Distribution 


The last section explained the ‘forward view’, where a policy is chosen in advance 
that minimises the expected loss. The Bayesian can also act sequentially by 
updating their beliefs (the prior) as data is observed to obtain a new distribution 
on the set of environments (more generally, the set of hypotheses). The new 
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distribution is called the posterior. This is simple and well defined when the 
environment set is countable, but quickly gets technical for larger spaces. We 
start gently with a finite case and then explain the measure-theoretic machinery 
needed to rigourously treat the general case. 

Suppose you are given a bag containing two marbles. A trustworthy source 
tells you the bag contains either (a) two white marbles (ww) or (b) a white 
marble and a black marble (WB). You are allowed to choose a marble from the 
bag (without looking) and observe its colour, which we abbreviate by ‘observe 
white’ (OW) or ‘observe black’ (OB). The question is how to update your ‘beliefs’ 
about the contents of the bag having observed one of the marbles. The Bayesian 
way to tackle this problem starts by choosing a probability distribution on the 
space of hypotheses, which, incidentally, is also called the prior. This distribution 
usually reflects one’s beliefs about which hypotheses are more probable. In the 
lack of extra knowledge, for the sake of symmetry, it seems reasonable to choose 
P(ww) = 1/2 and P(wB) = 1/2. The next step is to think about the likelihood 
of the possible outcomes under each hypothesis. Assuming that the marble is 
selected blindly (without peeking into the bag) and the marbles in the bag are 
well shuffled, these are 


P(ow|ww) =1 and P(ow|ws) = 1/2. 


The conditioning here indicates that we are including the hypotheses as part of 
the probability space, which is a distinguishing feature of the Bayesian approach. 
With this formulation we can apply Bayes’ law (Eq. (2.2)) to show that 


_ P(ow|ww)P(ww) _ P(ow | ww) P(ww) 
P(ww |ow) = P(ow) ~ P(ow| ww)P(ww) + P(ow | we)P(ws) 
1x 5 2 


1xġ+ix4 3° 


Of course P(wB|ow) = 1 — P(ww|ow) = 1/3. Thus, while in the lack of 
observations, ‘a priori’, both hypotheses are equally likely, having observed a 
white marble, the probability that the bag originally contained two white marbles 
(and thus the bag has a white marble remaining in it) jumps to 2/3. An alternative 
calculation shows that P(ww | OB) = 0, which makes sense because choosing a 
black marble rules out the hypothesis that the bag contains two white marbles. 
The conditional distribution P(-| ow) over the hypotheses is called the posterior 
distribution and represents the Bayesian’s belief in each hypothesis after observing 
a white marble. 


A Rigorous Treatment of Posterior Distributions 


A more sophisticated approach is necessary when the hypothesis and/or outcome 
spaces are not discrete. In introductory texts, the underlying details are often 
(quite reasonably) swept under the rug for the sake of clarity. Besides the desire 
for generality, there are two reasons not to do this. First, having spent the effort 
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developing the necessary tools in Chapter 2, it would seem a waste not to use them 
now. And second, the subtle issues that arise highlight some real consequences of 
the differences between the Bayesian and frequentist viewpoints. As we shall see, 
there is a real gap between these viewpoints. 

Let O be a set called the hypothesis space and G be a o-algebra on ©. While 
© is often a subset of a Euclidean space, we do not make this assumption. A prior 
is a probability measure Q on (0,G). Next, let (U, H) be a measurable space and 
P = (P; : 0 € ©) bea probability kernel from (0,G) to (U,H). We call P the 
model. Let Q = O xU and F = G QH. The prior and the model combine to yield 
a probability P = Q & P on (Q, F). The prior is now the marginal distribution 
of the joint probability measure: Q(A) = P(A x U). Suppose a random element 
X on Q describes what is observed. Then, generalizing the previous example 
with the marbles, the posterior should somehow be the marginal of the joint 
probability measure conditioned X. To make this more precise, let (X, J) be a 
measurable space and X : Q — X a F/J-measurable map. The posterior having 
observed that X = x should be a measure Q(- |x) on (0,9). 


We abuse notation by letting 0 : Q — © denote the #/G-measurable random 
element given by the projection: 6((¢, u)) = ¢. This allows @ being used as 
part of the probability expressions below. 


Without much thought, we might try and apply Bayes’ law (Eq. (2.2)) to claim 
that the posterior distribution having observed X (w) = a should be a measure 
on (0,G) given by 


P(X =x|0 € A)P(0€ A) 


Q(A|z) =P(@EA|X =2) = P(X =a) 


(34.1) 
The problem with the ‘definition’ in (34.1) is that P(X = x) can have measure 
zero, and then P (0 € A| X =z) is not defined. This is not an esoteric problem. 
Consider the problem when @ is randomly chosen from © = R and its distribution 
is Q = .N(0,1), the parameter 6 is observed in Gaussian noise with a variance of 
one: U = R, Pp = N(6,1) for all 8 € Rand X(¢, u) = u for all (¢, u) € Ox. Even 
in this very simple example, we have P (X = x) = 0 for all x € R. Having read 
Chapter 2, the next attempt might be to define Q(A| X) as a o(X)-measurable 
random variable defined using conditional expectations: for A € G, 


Q(A|x) = E[T{@ € A} | X](2), 


where we remind the reader that E[I {0 € A} | X] is a o(X)-measurable random 
variable that is uniquely defined except for a set of measure zero and also that 
the notation on the right-hand side is explained in Fig. 2.4 in Chapter 2. For 
most applications of probability theory, the choice of conditional expectation 
does not matter. However, as we shortly illustrate with an example, this is not 
true here. A related annoying issue is that Q(- |x) as defined above need not be 
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a measure. By assuming that (O, G) is a Borel space, this issue can be overcome 
by using a regular version (Theorem 3.11), a result that we restate here using 
the present notation. 


THEOREM 34.1. If (©,G) is a Borel space, then there exists a probability kernel 
Q:XxG > [0,1] such that Q(A| X) = P (80 € A| X) simultaneously for all A € G 
outside of some P-null set. Furthermore, for any two probability kernels Q, Q' 
satisfying this condition, Q(- |x) = Q'(- |x) for all x in some set of P x -probability 
one. 


The Posterior Density 

Theorem 34.1 provides weak conditions under which a posterior exists but does 
not suggest a useful way of finding it. In many practical situations, the posterior 
can be calculated using densities. Given 0 € © let pọ be the Radon—Nikodym 
derivative of Py with respect to some measure u and let q(0) be the Radon- 
Nikodym derivative of Q with respect to another measure v. Provided all terms 
are appropriately measurable and non-zero, then 


po(a)q(9) 
q(O|x) = 34.2 
O12) = Talay) m 
is the Radon-Nikodym derivative of Q(- |x) with respect to v, also known 


as the posterior density of Q. In other words, for any A € G, it holds that 
Q(A|x) = f a(0|x)dv(0). This corresponds to the usual manipulation of 
densities when js and v are the Lebesgue measures. 

The reader may wonder about why all the fuss about the existence of Q(-| x) 
in the previous section if we can get its density with a simple formula like (34.2). 
In other words, why not flip around things and define Q(-| x) via (34.2)? The 
crux of the problem is that oftentimes it is hard to come up with an appropriate 
dominating measure u, and in general the denominator in the right-hand side of 
(34.2) could be zero from some particular value of x. But when we can identify 
an appropriate measure u and the denominators are non-zero, the above formula 
can indeed be used as the definition of Q(-| x) (Exercise 34.4). 


The Non-uniqueness Issue Frequentists Face 
A minor annoyance when using Bayesian methods as part of a frequentist argument 
is that the posterior need not be unique. 


EXAMPLE 34.2. Consider the situation when the hypothesis set is the [0, 1] 
interval, the prior is the uniform distribution, and the observation is equal to the 
hypothesis sampled. Formally, © = [0,1] and the prior Q is the uniform measure 
on (0, 8(0)), Pa = dg is the Dirac measure on [0,1] at 0, and X : [0,1] > [0,1] is 
the identity: X(x) = x for all x € [0,1]. Let C c [0,1] be an arbitrary countable 
set and u be an arbitrary probability measure on ((0, 1], B(R)). It is not hard to 
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see that the probability kernel 


b2(A), ifa€C; 


aaa) = fO feed 


satisfies the conditions of Theorem 34.1 and is thus one of the many versions of 
the posterior, regardless of the choice of C and u! 

A true Bayesian is unconcerned. If 0 is sampled from the prior Q, then the 
event {X € C} has measure zero, and there is little cause to worry about events 
that happen with probability zero. But for a frequentist using Bayesian techniques 
for inference, this actually matters. If 0 is not sampled from Q, then nothing 
prevents the situation that 0 € C and the non-uniqueness of the posterior is an 
issue (Exercise 34.12). Probability theory does not provide a way around this 
issue. 


It follows that one must be careful to specify the version of the posterior 
being used when using Bayesian techniques for inference in a frequentist 
setting because in the frequentist viewpoint, 0 is not part of the probability 
space and results are proven for Pg for arbitrary fixed 0 € O. By contrast, 
the all-in Bayesians include @ in the probability space and thus will not worry 
about events with negligible prior probability, and for them any version of 
the posterior will do. 


Although it is important to be aware of the non-uniqueness of the posterior, 
practically speaking it is hard to go wrong. In typical applications, there is a 
‘canonical’ choice. For example, in the Gaussian prior and model case studied 
below, it feels right to choose the posterior to be Gaussian. More generally, 
preferring posteriors with continuous densities with respect to the Lebesgue 
measure is generally a parsimonious choice. 


Conjugate Pairs, Conjugate Priors and the Exponential Family 


One of the strengths of the Bayesian approach is the ability to incorporate 
explicitly specified prior beliefs. This is philosophically attractive and can be 
enormously beneficial when the user has well-grounded prior knowledge about 
the problem. When it comes to Bayesian algorithms, however, this advantage 
is belied a little by the competing necessity of choosing a prior for which the 
posterior can be efficiently computed or sampled from. The ease of computing 
(or sampling from) the posterior depends on the interplay between the prior and 
the model. Given the importance of computation, it is hardly surprising that 
researchers have worked hard to find models and priors that behave well together. 
A prior and model are called a conjugate pair if the posterior has the same 
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parametric form as the prior. In this case, the prior is called a conjugate prior 
to the model. 


Gaussian Model/Gaussian Prior 

Suppose that (0,G) = (9, F) = (R, B(R)) and X : Q > Q is the identity and 
Po is Gaussian with mean 0 and known signal variance oZ. If the prior Q is 
Gaussian with mean jp and prior variance c}, then the posterior distribution 
having observed X = x can be chosen to be 


acio =w (melee tales (2 a F 


1/o%+1/0% ’\o% o3 


The proof is left to the reader in Exercise 34.1. 


Following convention, from now on we sweep under the rug that this posterior 
is one of many choices, which is justified because all posteriors must agree 
almost everywhere. 


The limiting regimes as the prior/signal variance tend to zero or infinity are 
quite illuminating. For example, as 0% — 0 the posterior tends to a Gaussian 
N(up,op), which is equal to the prior and indicates that no learning occurs. 
This is consistent with intuition. If the prior variance is zero, then the statistician 
is already certain of the mean, and no amount of data can change their belief. 
On the other hand, as oĉ tends to infinity, we see the mean of the posterior 
has no dependence on the prior mean, which means that all prior knowledge is 
washed away with just one sample. You should think about what happens when 
o2 — {0, co}. 

Notice how the model has fixed oĉ, suggesting that the model variance is 
known. The Bayesian can also incorporate their uncertainty over the variance. In 
this case, the model parameters are © = R x [0, o0) and Py = N (81,02). But is 
there a conjugate prior in this case? Already things are getting complicated, so we 
will simply let you know that the family of Gaussian-inverse-gamma distributions 
is conjugate. 


Bernoulli Model/Beta Prior 

Suppose that © = [0,1] and Py = B(0) is Bernoulli with parameter 8. In this 
case, it turns out that the family of beta distributions is conjugate, which for 
parameters 0 = (a,8) € (0,00)? 
function with respect to the Lebesgue measure: 


gil (a+ 8) 
r(a)r(8) ’ 

where T(x) is the Gamma function. Then the posterior having observed X = x € 

{0,1} is also a beta distribution with parameters (a + x, 8 + 1 — x). Unlike in 


is given in terms of its probability density 


Po,a(#) = 1%} (1 — 2) (34.3) 
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the Gaussian case, the posterior for the Bernoulli model and beta prior is unique 
(Exercise 34.2). 


Exponential Families 


Both the Gaussian and Bernoulli families are examples of a more general family. 
Let h be a measure on (R, 8(R)) and S,n : R —> R be two ‘suitable’ functions, 
where S is called the sufficient statistic. Together, h, 7 and S define a measure 
Po on (R, B(R)) for each 0 € © CR in terms of its density with respect to h: 


wea) = exp (n(0)S (x) — A(0)) , 


where A(#) = log fg exp(7(0)S(x))dh(x) is the log-partition function and 
© = dom(A) = {0 : A(0) < oo} is the domain of A. Integrating the density 
shows that for any B € B(R) and 0 € ©, 


Po(B) = (x) dh(a) = [ox (n8) S(x) — A()) dh(a). 


The collection (Pg : 6 € ©) is called a single-parameter exponential family. 
An exponential family is regular if © is non-empty and open. It is non-singular 
if A” (0) > 0 for all 0 € ©. 


EXAMPLE 34.3. Let o? > 0 and h = N(0,07) and 7(6) = £ and S(x) = 2. An 
easy calculation shows that A(0) = 0?/(207), which has domain © = R and 
Po = N (0, g’): 


EXAMPLE 34.4. Let h = ôo + ôı be the sum of Dirac measures and S(x) = x 
and 7(@) = 0. Then A(@) = log(1 + exp(0)) and © = R and P} = B(o(0)), where 
o(0) = exp(0)/(1 + exp(0)) is the logistic function. 


EXAMPLE 34.5. The same family can be parameterised in many different ways. 
Let h = 69 + 51, S(x) = x and n(0) = log(6/(1 — 0)). Then A(@) = —log(1 — 0) 
and © = (0,1) and Py = B(0). 


Exponential families have many nice properties, some of which you will prove 
in Exercise 34.5. Of most interest to us here is the existence of conjugate priors. 
Suppose that (P : 0 € ©) is a single-parameter exponential family determined by 
h, n and S, where S(x) = a is the identity map. Let £o, no € R, and define prior 
measure Q on (©, 8(0)) in terms of its density q = dQ/dA with A the Lebesgue 
measure: 


q(6) = exp (nozon(0) — noA(9)) 

Jo exp (noxon(@) — no A(0)) dO’ 
where we assume that the integral in the denominator exists and is positive. 
Suppose we observe X = x. Then a choice of posterior has density with respect 


(34.4) 
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to the Lebesgue measure given by 


Ola- —£xPLNO)(@+ not) = (1 no)A)) 

fo exp (n(0) (x + noxo) — (1 + no) A(0)) dA(A) 
What this means is that after observing the value x, the posterior takes the form 
of the prior except that the parameters (£o, no) associated with the prior get 
updated to ((nozo + x)/(no + 1), no + 1). The posterior is both easy to represent 
and maintain. To see how exponential families recover previous examples, consider 
the Bernoulli case of Example 34.5. Since 


0 nozo 
M E = (5) (1 — 9)" = groro(y — g)no—20) 


1—9 
we see that the prior from (34.4) is a beta distribution with parameters 


a = 1 + nozo and 8 = 1 + no(1 — zo), as can be seen from (34.3). As expected, 
the posterior update also works as described earlier. 


There are important parametric families with conjugate priors that are not 
exponential families. One example is the uniform family (U (a,b) : a < b), 
which is conjugate to the Pareto family. 


Sequences of Random Variables and the Markov Chain View 


Let P” = (Pj : 0 € O) be a probability kernel from (©, G) to (¥”, H8”) and 
Q a prior on (©, G). Then, let 6 and X1,..., Xn be random elements on some 
probability space (Q, F, P), where 0 € © and X; € & such that 


(a) the law of 0 is Pg = Q; and 
(b) P(X1,..., Xn € B|0) = P(B) almost surely for all B € H®”. 


By definition, the posterior after observing (X,){_, is a probability kernel Q; 
from (¥',H®") to (©, G) such that for any B € G, 


[HO € B}|X1,..., Xe] = Qi( B| X1,..., Xz) almost surely . 


t 
s=l1 


Then, by the tower rule, the conditional distribution of X,,, given (X,) 
almost surely satisfies 


P (X1 E€ Bl X,..., Xz) -| Pi(Xi+ı € B|X1,.-., Xz) Qe (dP | X1, ..., Xc). 
Ə 
(34.5) 


This identity says that the conditional distribution of X;41 can be written in terms 
of the model and posterior. In the fundamental setting where P? = Py ®@---@ Po 
is a product probability measure, then Eq. (34.5) reduces to 


P(Xi41 € B| Xu. X) = PBO) Xi,...,X), 
(3) 
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which shows that in this case the posterior summarises all the useful information 
in (X,)£_, for predicting future data. By introducing a little measure-theoretic 
machinery and making suitable regularity assumptions, it is possible to show that 
the sequence Q1,...,Qn is a time-inhomogeneous Markov chain. In many cases, 
the posterior has a simple form, as you can see in the next two examples. 


EXAMPLE 34.6. Suppose © = [0,1] and G = %([0,1]) and Q = Beta(a, 8) 
and P = B(0) is Bernoulli. Then the posterior after t observations is Q = 
Beta(a+5$;, 8+t—S;), where S; = $t; Xs. Furthermore, E[Xi41|X1,..-, Xi] = 
1a: [Xi] = (a + S+)/(a + 6 + t), and hence 


a+ Si 
P(Si41 = S; +1] S) = ———— 
(Si+ t+1|S:) aF ee 
B+t-S, 
P (Siri = Sil Si) = ————_.. 
(Si+1 = St] St) a+8+t 
So the posterior after t observations is a Beta distribution depending on S; and 
S1,59,...,;Sn follows a Markov chain evolving according to the above display. 


EXAMPLE 34.7. Let (0,9) = (R,B(R)) and Q = N(u,07) and Pp = N (0,1). 
Then, using the same notation as above the posterior is almost surely Q; = 
N (4,07), where 


=i 

u/o? + Si 5 1 

= — ae a d = e t . 
a Jo? +t A 74 o? d 

Then 61, S2,..., Sn is a Markov chain with the conditional distribution of S41 

given S, a Gaussian with mean S; + u+ and variance 1 + 0?. 


The Bayesian Bandit Environment 


The Bayesian bandit model is the same as the frequentist version introduced 
in Chapter 4, except that at the beginning of the game, an environment is 
sampled from the prior. Of course, the chosen environment is not revealed to 
the learner, but its presence forces us to change our conditions on the rewards 
because the rewards are dependent on each other through the chosen environment. 
For simplicity, we treat only the finite, k-armed case, but the more general set-up 
is handled in the same was as in Chapter 4. 

A k-armed Bayesian bandit environment is a tuple (E€, G, Q, P), where 
(E,G) is a measurable space and Q is a probability measure on (E, G) called the 
prior. The last element P = (Pp; : v € €,i € [k]) is a probability kernel from 
E x [k] to (R, B(R)), where P,; is the reward distribution associated with the ith 
arm in bandit v. A Bayesian bandit environment and policy m = (m;)?_, interact 
to produce a collection of random variables, v € E, (Az)?_, and (X+)f—; with 
Ar € [k] and X; € R that satisfy 


(a) Pve-)=Q(); 
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(b) the conditional distribution of action A; given v, A1, X1,..., A¢—1, X¢_1 is 
mil | A1, X1,---, At-1, Xt-1) almost surely; and 

(c) the conditional distribution of the reward X; given v, A1, X1,..., Az is Pua, 
almost surely. 


The existence of a probability space carrying random elements satisfying 
these conditions is guaranteed by the Ionescu-Tulcea theorem (Theorem 3.3, 
Exercise 34.9). The corresponding probability measure will be denoted by Pgp-. 


Most of the structure of a Bayesian bandit environment is in P, which 
determines the reward distribution for each arm 7 in bandits v € E. 


EXAMPLE 34.8. A k-armed Bayesian Bernoulli bandit environment could be 
defined by letting E = [0,1]*, G = B(E) and P,; = B(v;). A natural prior in this 
case would be a product of Beta(a, 3) distributions: 


= [Tae 


where g(x) = #°~1(1 — x)? 7T (a + 8)/(T(a)I(S)). 


Posterior Distributions in Bandits 


Let (€,G,Q, P) be a k-armed Bayesian bandit environment. Assuming that (€,G) 
is a Borel space, Theorem 34.1 guarantees the existence of the posterior: a 
probability kernel Q(-|-) from the space of histories to (E, G) so that 


OA) ai, £1,- .., at, £t) 


is a regular version of E[l; (v) | A1, X1,...,A¢, X;]. For explicit calculations, it is 
worth adding some some extra structure: assume there exists a o-finite measure A 

n (R, B(R)) such that P,; < » for alli € [k] and v € E. Recall from Chapter 15 
that the Radon—Nikodym derivative of P,, with respect to (p x A)” is 


Dun (Q1,01,---;4n,2n) = II Tilat | Q1,21,--- , @t—-12t-1) Pua, (Tt), (34.6) 


where p,q is the density of P,a with respect to A. Then the posterior after t 
rounds is given by 

Ja Pun(1, Ti,- -3 0t, r1)dQ(v) 

Je Poa Q1,71,.--,4t, x4)dQ(v) 

— Je Tsai Pra. (ts)dQ(v) 
Sel a= (Pas Ls)dQ(v) 


Q(B | Gay £1,... , Qt, Tt) = 


i (34.7) 
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where the second equality follows from Eq. (34.6). The posterior is not 
defined when the denominator is zero, which only occurs with probability zero 
(Exercise 34.11). Note that the Radon—Nikodym derivatives pya(x) are only 
unique up to sets of P,,-measure zero, and so the ‘choice’ of posterior has been 
converted to a choice of the Radon—Nikodym derivatives, which, in all practical 
situations is straightforward. Observe also that Eq. (34.7) is only well defined 
if pya,(-) is G-measurable as a function of v. Fortunately this is always possible 
(see Note 8). 


EXAMPLE 34.9. The posterior for the Bayesian bandit in Example 34.8 in terms 
of its density with respect to the Lebesgue measure is 


k 
q(0 | a1,%1,...,@4,24) X Me — Q;)f tt: (e)s: (he)—1 : 
Oc ee 
hi i=l 


where s;(ht) = Xt 1 Zul {ay = i} and ti(hi) = Xt [{au = i}. This means the 


u=1 
posterior is also the product of Beta distributions, each updated according to the 
observations from the relevant arm. 


Bayesian Regret 


Recall that the regret of policy 7 in k-armed bandit environment v over n rounds 
is 


(34.8) 


Rn(7,v) = ny -E > Xt 
t=1 


where u* = maxje x] Hi and p; is the mean of P,;. Given a k-armed Bayesian 
bandit environment (€,G,Q, P) and a policy 7, the Bayesian regret is 


BRn(7,Q) = | Ro(n.»)dQ(r). 
E 

The dependence on E, G and P is omitted on the grounds that these are 

always self-evident from the context. The Bayesian optimal regret is BR (Q) = 

inf, BR,(7,Q), and the optimal (regret-minimizing) policy is 


ma =argmin, BR, (T, Q). (34.9) 


Note that the regret-minimising policy is the same as the reward-maximising 
policy 7* = argmax, Epo p, |} i1 X+], which is known as the Bayesian optimal 
policy under prior Q. In all generality, there is no guarantee that the (Bayes) 
optimal policy exists, but the non-negativity of the Bayesian regret ensures that 
for any € > 0, there exists a policy m with BR,(z,Q) < BR (Q) + €. 
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The fact that the expected regret R,,(7,v) is non-negative for all v and m 
means that the Bayesian regret is always non-negative. Perhaps less obviously, 
the Bayesian regret of the Bayesian optimal policy can be strictly greater 
than zero (Exercise 34.8). 


Notes 


1 In Chapter 4, we defined the environment class € as a set of tuples of probability 
distributions over the reals. In a Bayesian bandit environment (E, G, Q, P) the 
set € is arbitrary and the reward distributions are given by the probability 
kernel P. The probability kernel and the change of notation is needed because 
we are now integrating the regret over €, which may not be measurable without 
additional conditions. 

2 The Bayesian regret of an algorithm is less informative than the frequentist 
regret. By this we mean that a bound on BR, (7, Q) does not generally imply 
a meaningful bound on R,,(7,v), while if R,(7,v) < f(v) for a measurable 
function f, then BR,(7, Q) < E[f(v)]. This is not an argument against using 
a Bayesian algorithm but rather an argument for the need to analyse the 
frequentist regret of Bayesian algorithms. 


3 The relationship between admissibility, Bayesian optimality and minimax 
optimality is one of the main topics of statistical decision theory, which intersects 
heavily with game theory. In many classical statistical settings, all Bayesian 
optimal policies are admissible (Exercise 34.14), and all admissible policies 
are either Bayesian optimal for some prior or the limit point of a sequence of 
Bayesian optimal policies (Exercise 34.13). Be warned, however, that there 
are counterexamples. A nice book with many examples is by Berger [1985]. 
In Exercise 34.15, you will prove that all admissible policies for stochastic 
Bernoulli bandits are Bayesian optimal for some prior. 

4 While admissibility and related notions of optimality are helpful in being clear 
about the goals of algorithm design, we must recognise that these concepts are 
too binary for most purposes. One problem with the classic decision theory 
literature is that it puts too much emphasis on these narrow concepts. Who 
would argue that a policy that is dominated, but just barely, is worth nothing? 
Especially since the optimal policy is often intractable. Meaningful ways of 
defining slightly worse usually consider a bigger picture when a policy design 
approach (policy schema) is evaluated across many problem classes. In the 
Bayesian setting, one may for example consider all k-armed stochastic Bayesian 
bandits with (say) bounded rewards and consider policy schema that work no 
matter what the environment is. An example of a policy schema is Thompson 
sampling, since it can be instantiated for any of the environments. One may 
ask whether such a policy schema is near Bayesian (or minimax) optimal across 
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all of the considered environments. In fact, most bandit algorithm design is 
better viewed as designing policy schema. 

Many algorithms/statistical methods have Bayesian interpretations. One 
example is ridge regression, which we saw in Chapter 20. Using the notation 
of that chapter, the estimator given in Eq. (20.1) is the mean of the Bayesian 
posterior when the model is Gaussian with known variance and the prior on the 
unknown parameter is a Gaussian with zero mean and covariance I/A. Another 
example is the exponential weighting algorithm for prediction with expert 
advice. Consider a sequence y1,..., Yn € |d] and suppose there are set M of k 
experts making predictions about ys. We write u(- | y1,- -, Y+-1) E€ Pa-1 for the 
distribution of y, predicted by expert u € M. In each round the learner observes 
the predictions u(-|y1,...,Yy+—1) for all experts u E€ M and should make 
a prediction €(-|y1,..-,ys-1) E Pa-1. Notice that defining u(y1,..-,Yn) = 
Il u(y | yt, ---,ye-1) makes ju(-) into a probability distribution on [d]”. The 
regret compares the learner’s performance relative to the best expert in M 
under the logarithmic loss: 


. Hie eco) Wisco te) 
Rn = max log ( = max log | === } . 
ace E(ye | Yis- Ye-1) pEM ElYr -3 Yn) 


A Bayesian approach to this problem is to assume that y1,...,Yn is sampled 
from some unknown p € M and choose a prior distribution Q € P(M) over 
the experts. Then predict to minimise the Bayesian expected loss, which you 
will show in Exercise 34.6 leads to 

predictive dist. of u 


l _ MY as ++ +5 Ye-1) Q(u) 
3 iteco = 2 al ee yeu he OM) ` 


posterior Q:—1(y) 


(34.10) 


You will also show that when Q is taken to be the uniform distribution, the 
regret is bounded by Rn < log(k) for all sequences yj,..., Yn. Simple algebraic 
manipulations show that the posterior is 


GQ (= Die loe (setae) | 


t 
Drea xP (— Erlos (agate) 


which is precisely the exponential weights distribution with learning rate 
n = 1. The analogy should not be taken too seriously, however. That this 
algorithm controls the regret for all sequences y1,..., Yn does not hold for more 
general loss functions. For this, the learning rate must be chosen much more 


conservatively. For more on the online learning approach to learning under the 
logarithmic loss, see chapter 9 of the book by Cesa-Bianchi and Lugosi [2006]. 
The Bayesian approach is covered in the book by Hutter [2004]. 

Sion’s minimax theorem provides a connection between minimax optimal regret 
and the maximum Bayesian optimal regret over all priors. Let II be the space 


N 


(oe) 


No} 
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of all policies and P be a convex space of probability measures over policies and 
Q be a convex space of probability measures on (€,G). Define £ : P x Q > R 
by 


£(8,Q)= f f Ram r)Q(ar)S(an), 


which is linear in both arguments because integrals are linear as a function of 
the measure. Suppose that £ is continuous in both arguments and at least one 
of P and Q is compact. Then, by Sion’s minimax theorem (Theorem 28.12), 


ee inf LCS, Q) = if pe Ei): (34.11) 


Usually P and Q include all Dirac measures: 
{ôr: n Elb} CP and {i,:vEE}CO, 


where IIp is the space of deterministic policies. Then the left-hand side of 
Eq. (34.11) is supgeg BR} (Q), and the right-hand side is the minimax regret 
Rž (€). Choosing P, Q and a measurable structure on II is not always easy. 
Examples may be found in Exercises 34.16 and 36.11. 

The issue of conditioning on measure zero sets has been described in many 
places. We do not know of a practical situation where things go awry. Sensible 
choices yield sensible posteriors. The curious reader could probably burn a few 
weeks reading through the literature on the Borel-—Kolmogorov paradox 
[Jaynes, 2003, §15.7]. 

Suppose that (P, : 0 € ©) is a probability kernel from (0,G) to (R, B(R)) for 
which there exists measure A on (R,B(R)) such that Py < à for all 0 € ©. 
Then there exists a family of densities pọ : R — [0,00) such that pg(x) is jointly 
measurable as a (8,2) +> pe(x) map and pọ = dP /dA for all 0 € ©. See the 
proof of Lemma 1.2 in Ghosal and van der Vaart [2017] or sections 1.3 and 1.4 
of the book by Strasser [2011]. 

The notion of a sufficient statistic is more general than its role in exponential 
families. Let X and Y be random elements on the same measurable space 
taking values in ¥ and y respectively. The random element Y is a sufficient 
statistic for X given a family of distributions (P9)9ce over the probability space 
carrying both X and Y if (i) Y is o(X)-measurable and (ii) for all 0 € ©, the 
conditional distribution Pg(X € - |Y) is independent of the value of 0. Formally, 
(ii) means there exists a probability kernel P from y to ¥ such that for any 
0 € O, Po(X €-|Y) = P(Y, -) holds Pg-almost surely. Informally, this means, 
that given Y, there is no information left about 0 in X. Denoting by Px, the 
distribution of X under Pg and without loss of generality letting Y = y(X) 
for some y : ¥ — Y measurable map (recall Lemma 2.5), and assuming that 
X,Y take values in Borel spaces, with the help of the disintegration theorem 
(Theorem 3.12), it is not hard to see that if (Px,9)9 have a common dominating 
o-finite measure p and for any 0 € O, xs (x) = h(x)go(y(x)) holds p-almost 
surely for all x € ¥ for some some h : XY —> [0,00) and go : Y > [0,00) Borel 
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measurable maps, then Y is a sufficient statistic for X. The Fisher-Neyman 
factorisation theorem states that the converse also holds. With some creative 
matching of concepts, we can see that in single-parameter exponential families, 
what we called a sufficient statistic satisfies the more general definition. 


Bibliographic Remarks 


The original essay by Thomas Bayes is remarkably 
readable [Bayes, 1763]. There are many texts on 
Bayesian statistics. For an introduction to the applied 
side, there is the book by Gelman et al. [2014]. This 
book offers lots of discussions and examples. A more 
philosophical book that takes a foundational look 
at probability theory from a Bayesian perspective 
is by Jaynes [2003]. The careful definition of the 
posterior can be found in several places, but the 
recent book by Ghosal and van der Vaart [2017] 
does an impeccable job. A worthy mention goes Thomas Bayes 

to the article by Chang and Pollard [1997], which 

uses disintegration (Theorem 3.12) to formalise the ‘private calculations’ that 
probabilists so frequently make before writing everything carefully using Radon- 
Nikodym derivatives and regular versions. Theorem 34.1 is a specification of the 
theorem guaranteeing the existence of regular conditional probability measures 
(Theorem 3.11). For a detailed presentation of exponential families, see the book 
by Lehmann and Casella [2006]. A compendium of conjugate priors is by Fink 
[1997]. 


Exercises 


34.1 (POSTERIOR CALCULATIONS) Evaluate the posteriors for each pair of 
conjugate priors in Section 34.3. 


34.2 (UNIQUENESS OF BETA / BERNOULLI POSTERIOR) Explain why the posterior 
for the Bernoulli model with a beta prior is unique. 


34.3 Use the tower rule to prove the identity in Eq. (34.5). 


34.4 (POSTERIOR IN TERMS OF DENSITY) Let P = (Py : 0 € O) be a probability 
kernel from (O, G) to (¥, H) and Q be a probability measure on © and P = Q8 P 
on O x X. As usual, let 0 and X be the coordinate projections on © x ¥. Let 
v and u be probability measures on (O,G) and (¥,H) such that Q < v and 
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Po < n for all 0 € O, and define 


po(x)q(9) 
Jo Pu(a)a)dv(h) ’ 


where po(x) = dP /du and q(0) = dQ/dv. You may assume that pg(x) is jointly 
measurable in 0 and x (see Note 8). 


q(O| x) = 


(a) Let N = {x : fo py(x)a()dv(y) = 0} and show that Px(N) =0. 

(b) Define Q(A|x) = J, q(0|x)dv(0) for x ¢ N and Q(A|z) be an arbitrary 
fixed probability measure for x € N. Show that Q(- |X) is a regular version 
of P(@€ -| X). 


Hint The ‘sections’ lemma may prove useful (Lemma 1.26 in Kallenberg 2002), 
along with the properties of the Radon—Nikodym derivative. 


34.5 (EXPONENTIAL FAMILIES) Let A, T, h, 7 and © be as in Section 34.3.1. 


(a) Prove that Pg, is indeed a probability measure. 
(b) Let Eg denote expectations with respect to Py. Show that A’(@) = Eọ[T]. 
(c) Let 0 € © and X ~ Po. Show that for all A with A+ 0 € 9, 


za lexp(AT(X))] = exp(A(A + 0) — A(0)). 
(d) Given 6,6’ € ©, show that 


hop po(X) _ yn ae 7 ! 
d(0, 6’) = Eg og (2 (X) | = A(6’) — A(@) — (0 — 0) A' (0). (34.12) 


(e) Let 6,6’ € © be such that A’(6’) > A'(0) and X1,..., Xn be independent 


and identically distributed and T = + ?_, T(X;). Show that 


P (7 > A'(6')) < exp (—nd(6",8)) . 


Curiously, the function d of Eq. (34.12) is both the relative entropy D(Po, Po’) 
and the Bregman divergence between 6’ and 0 induced by the convex function 
A. See Section 26.3 for the definition of Bregman divergence. 


34.6 (EXPONENTIAL WEIGHTS ALGORITHM) Consider the setting of Note 5 


(a) Prove the claim in Eq. (34.10). 
(b) Prove that when Q(u) = 1/k is the uniform prior, then the regret is bounded 
by Rn < log(k) for any yi,.--,Yn- 


34.7 (MEASURABILITY OF THE REGRET) Let (€,G,Q,P) be a Bayesian bandit 
environment and 7 a policy. Prove that R,(7,v), defined in Eq. (34.8), is G- 
measurable as a function of v. 


34.8 (BAYESIAN OPTIMAL REGRET CAN BE POSITIVE) Construct an example 
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demonstrating that for some priors over finite-armed stochastic bandits, the 
Bayesian regret is strictly positive: inf; BR»(a,Q) > 0. 


HINT The key is to observe that under appropriate conditions, BR,,(7,Q) = 0 
would mean that 7 needs to know the identity of the optimal action under v from 
round one, which is impossible when v is random and the model is rich enough. 


34.9 (CANONICAL MODEL) Prove the existence of a probability space carrying 
the random variables satisfying the conditions in Section 34.4. 


34.10 (SUFFICIENCY OF DETERMINISTIC POLICIES) Let Ip be the set of all 
deterministic policies and II the space of all policies. Prove that for any k-armed 
Bayesian bandit environment (€,G,Q,P), 

inf BR»(7,Q) = inf BR,»(7,Q). 

nell 


TEIIp 


34.11 Prove that the denominator in Eq. (34.7) is almost surely non-zero. 


34.12 (BAYESIAN OPTIMAL POLICIES CAN BE DOMINATED) Consider the set-up 
in Example 34.2. A Bayesian learner observes X ~ Pg and should choose an 
action A; € [0,1] that is o(X)-measurable. Their loss is 1{ A; 4 0}. 


(a) Show that the optimal choice is Ay = X+. 

(b) Give a Bayesian optimal algorithm with A; Æ X, on some non-empty 
(measure zero) event. 

(c) Give a Bayesian optimal algorithm and 0 such that the loss when @ is true 
(and so X ~ Py) is not zero. 


34.13 (ADMISSIBLE POLICIES ARE BAYESIAN FOR FINITE ENVIRONMENTS) Let 
E = {v,...,vn} and II be sets. Call the elements of E environments, and the 
elements of II policies (this is just to help to make connection to the rest of the 
material). Let Z: II x E — [0,00) be a positive loss function. Given a policy 7, let 
Ln) = (L(r, v), ..., 8m, vy)) be the loss vector resulting from policy 7. Define 
S = {&(r): m € I} CR and 


A(S) = {x € cl(S) : y £ z for all y € S}, 


where y £ x is defined to mean it is not true that y; < x; for all i with strict 
inequality for at least one i (A(S) is the Pareto frontier of set S, and its elements 
are the non-dominated loss-outcome vectors in cl(S)). Prove that if A(S) C S 
and S' is convex, then for every 7* € II such that @(7*) € A(S), there exists a 
prior q € P(E) such that 


5 q(v)l(n*,v) = min q(v)e(m,v). 
vee VEE 


Hint Use the supporting hyperplane theorem, stated in the hint after 
Exercise 26.2. 
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By identifying elements of € as ‘criteria’, the interpretation of the result of the 
exercise in multi-criteria optimisation is that for non-empty, convex, closed 
loss sets, solutions on the Pareto frontier (policies m such that (r) € A(S)) 
can be obtained by minimizing a convex combination of the individual 
criteria. There is also a connection to constrained optimisation where the 
constraints are expressed as a bounds on linear combinations of the losses. 


34.14 (UNIQUELY BAYES OPTIMAL POLICIES ARE ADMISSIBLE) Let (E€,G) bea 
measurable space and II an arbitrary set of the elements that we call policies. Let 
L:IIx E — R bea function with f(z, -) being G-measurable. Given a probability 
measure Q on (€,G), a policy is called Bayesian optimal with respect to Q if 


[em (T, v)dQ(v) = int if l(n'’,v)dQ(v) . 
Prove the following: 


(a) If 7a is the unique Bayesian optimal policy given prior Q, then 7 is admissible. 

(b) There is an example when a is a Bayesian optimal policy and ~ is inadmissible. 

(c) If E is countable and Supp(Q) = E, then any Bayes optimal policy 7 is 
admissible. 

(a) If a is Bayesian optimal with respect to prior Q, then it is admissible on 
Supp(Q) C E. 


34.15 (ADMISSIBLE POLICIES ARE BAYESIAN FOR BERNOULLI BANDITS) Let E 
be the set of k-armed Bernoulli bandits. Prove that every admissible policy is 
Bayesian optimal for some prior. 


HıNT Argue that all policies can be written as convex combinations of 
deterministic policies using an appropriate linear structure. Then identify the 
spaces of environments and policies with compact metric spaces. Let (v;)% joi be 
a dense subset of € and repeat the argument in the previous exercise with each 
finite subset {v1,..., vj}, and then take the limit as j — oo. You will probably 
find Theorem 2.14 useful. 


34.16 Let € = EX be the space of k-armed stochastic Bernoulli bandits. Endow 
E with a topology via the natural bijection to [0,1]*, and let Q be the space of 
all probability measures on (€,8(E)) with the weak* topology. Prove that 


max BR; (Q) = R,(E). 


HINT Use Theorem 2.14 and Sion’s theorem (Theorem 28.12). 
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35.1 


Bayesian Bandits 


The first section of this chapter provides simple bounds on the Bayesian optimal 
regret, which are obtained by integrating the regret guarantees for frequentist 
algorithms studied in Part II. This is followed by a short interlude on the basic 
theory of optimal stopping, which we will need later. The next few sections 
are devoted to special cases where computing the Bayesian optimal policy is 
tractable. We start with the finite horizon Bayesian one-armed bandit problem 
where the existence of a tractable solution is reduced to the computation of a 
sequence of functions on the sufficient statistics of the arm with the unknown 
pay-off. Next, the k-armed setting is considered. The main question is whether 
there exists a solution that avoids considering joint sufficient statistics over all 
arms, which would be intractable in the lack of further structure (see Note 2). 
Avoiding the joint sufficient in general is not possible, but in the remarkable 
case of the problem of maximising the total expected discounted reward over an 
infinite horizon, where John C. Gittins’s celebrated result shows that the Bayesian 
optimal policy takes the form of an ‘index’ policy that keeps statistics for each 
arm separately (updated based on the arm’s observations only) to compute a 
value (‘index’) for each arm, in each round choosing the arm with the highest 
index. 


Bayesian Optimal Regret for k-Armed Stochastic Bandits 


Even in relatively benign set-ups, the computation of the Bayesian optimal policy 
appears hopelessly intractable. Nevertheless, one can investigate the value of the 
Bayesian optimal regret by proving upper and lower bounds. 

For simplicity, we restrict our attention to Bernoulli bandits, but the arguments 
generalise to other models. Let (€,G) = ([0, 1]*, 8((0,1]*)), and for v € [0,1]* 
let P,; = B(v;). Choose some prior Q on (€,G). The Bayesian optimal regret is 
necessarily smaller than the minimax regret, which by Theorem 9.1 means that 


BR} (Q) < Cvkn, 


where C > 0 is a universal constant. The proof of the lower bound in Exercise 15.2 
shows that for each n, there exists a prior Q for which 
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where c > O is a universal constant. These two together show that the 
supo BR4(Q) = (Vn). 

Turning to the asymptotics for a fixed distribution, recall that that for any fixed 
Bernoulli bandit environment, the asymptotic growth rate of regret is O(log(n)). 
In stark contrast to this, the best we can say in the Bayesian case is that the 
asymptotic growth rate of BR},(Q) is slower than y/n, but for some priors, y/n is 
almost a lower bound on the growth rate. In particular, we ask you to prove the 
following theorem in Exercise 35.1: 


THEOREM 35.1. For any prior Q, 


BR,,(Q) 


ni/2 


lim sup =0. 


n— oo 


Furthermore, there exists a prior Q such that for all e > Q, 


__ +» BR3(Q) 
A ee 


The lower bound has a worst-case flavour in the sense it holds for a specific prior. 
The prior that yields the lower bound is a little unnatural because it assigns the 
overwhelming majority of its mass to bandits with small suboptimality gaps. In 
particular, Q({v E E : Amin(v) < A}) > c/log(1/A) for some constant c > 0. For 
more regular priors, the Bayesian optimal regret satisfies BR*(Q) = O(log?(n)). 
See the bibliographic remarks for pointers to the literature. 


Optimal Stopping (-®) 


We now make a detour to show some results of optimal stopping, which will be 
used in the next sections to find tractable solutions to certain Bayesian bandit 
problems. 

The first setting we consider will be useful for the one-armed bandit problem. 
Let (U+); be a sequence of random variables adapted to filtration F = (F;)?_). 
Optimal stopping is concerned with finding solutions to optimisation problems of 
the following form: 


sup E[U,], (35.1) 
TERT 


where R? is the set of F-stopping times 7 with 1 < r < n. When n is finite, 
the situation is conceptually straightforward. The idea is to use backwards 
induction to define the expected optimal utility conditioned on the information 
in F; starting from t = n and working backwards to t = 1. The Snell envelope 
is a sequence of random variables (E;)?_, defined by 


B= Us: ift=n; 
‘ max {U;, E[E:41|Fi]} , otherwise . 
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Intuitively, E; is the optimal expected value one can guarantee provided that 
stage t was reached. 


THEOREM 35.2. Assume that n is finite and U; is integrable for all t € [n]. Then 
the stopping time T = min{t € [n] : U, = E,} E€ RẸ achieves the supremum in 
Eq. (35.1). 


Backwards induction is not directly applicable when the horizon is infinite. 
There are several standard ways around this problem. For our purposes, the most 
convenient workaround is to introduce a Markov structure. The connection to 
the Bayesian bandit setting is that in the Bayesian setting, posteriors follow a 
Markov process. The connection will be made explicit in a few examples in later 
sections. 

Let (S,G) be a Borel space and (P, : x E€ S) be a probability kernel from S to 
itself and u : S > R be S/8(R)-measurable. A Markov reward process is a 
Markov chain (S;)?2, evolving according to P and a sequence of random variables 
(Uz), with U; = u(S;). Define the filtration F = (F;)?2, with F; = 0($1,..., S:). 
The (Markov) optimal stopping problem is 


sup E[U;], 

TER 
where 9A; is the set of F-adapted stopping times, and the initial distribution of 
Sı is arbitrary. Inspired by the solution of the finite horizon problem define the 
value function v : S > R by 


v(x) = sup E,[U;], (35.2) 
TERY 


where P, is the probability measure on the space carrying (.5;)?2, for which 
P,.(S, = x) = 1 and E, be the expectation with respect to Py. As before, the 
idea is to stop when U; is above f v(y)Ps,(dy), the predicted optimal value of 
continuing. Note that ties can be resolved in any way (depending on S;, one may 
or may not stop when the predicted optimal value of continuation is equal to U+). 
The next result gives sufficient conditions under which stopping rules of this form 
are indeed optimal. 


THEOREM 35.3. Assume for all x € S that Ux =limn+.0 Un exists Py-a.s. and 
sup,,>1|Un| is P,-integrable. Then v satisfies the Wald—Bellman equation, 


v(x) = max{u(2), | u(y) Pr (dy) } forallxeS. 
5 
Furthermore, limp—oo U(Sn) = Uso Pz-a.s., and the supremum in Eq. (35.2) is 
achieved by any stopping time T such that for all t, 


(a) T <t on the event that U, > J, v(y) Ps, (dy); and 
(b) T >t on the event that U, < J. v(y)Ps,(dy) and T > t. 


The conditions are satisfied in many practical applications, e.g. if the Markov 
chain is ergodic and the utility function is bounded over the state space. In 


E 
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our application, U„ will be an accumulation of discounted rewards, which in all 
standard situations converges very fast. 


A natural choice of stopping time satisfying conditions (a) and (b) in 
Theorem 35.3 is 7 = min{t > 1: v(S;) = U+}. The conditions express that 
in the indifference region {x € S : u(x) = fs v(y)Px(dy)}, both stopping 
and continuing are acceptable. 


The proof of Theorem 35.2 is straightforward (Exercise 35.2). Measurability 
issues make the proof of Theorem 35.3 more technical (Exercise 35.3). Pointers 
to the literature are given in the notes, and a solution to the exercise is available. 


One-armed bandits 


The one-armed Bayesian bandit problem is a special 
case where the Bayesian optimal policy has a simple 
form that can often be computed efficiently. Before 
reading on, you might like to refresh your memory 
by looking at Exercises 4.11 and 8.2. Let (€,G, Q, P) 
be a two-armed Bayesian bandit environment, where 
Py2 = Oy, is a Dirac at fixed constant 2 € R for 
all v € €. Because the mean of the second arm is 
known in advance, we call this a one-armed Bayesian 
bandit problem. In Part (a) of Exercise 4.11, you 
showed that when the horizon is known, retirement 


policies that choose the first arm until some random Figure 35.1 When will you 
time before switching to the second arm until the stop playing? A one-armed 
end of the game (pointwise over v) dominate all Bayesian bandit. 
other policies in terms of regret. Since we care about 
Bayesian optimal policies, the result of Exercise 34.10 allows us to restrict our 
attention to deterministic retirement policies. 

These facts allow us to frame the Bayesian one-armed bandit problem in 
terms of optimal stopping. Define a probability space (Q, F, P) carrying random 
elements v € € and Z = (Z;)?_, where 


(a) the law of v is P, = Q; and 
(b) P(Z €-|v) = P% (), which means that after conditioning on v, the sequence 
(Z,)f_, is independent and identically distributed according to P,1. 


Given a deterministic retirement policy m = (7;)?_,, define the random variable 
7=min{t>1:7,(2|1,2,...,1,2%-1) =1}, 


where the minimum of an empty set in this case is n+1. Clearly 7 is an F-stopping 
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time, where F = (Fà rt} with F, = 0(Z1,..., Z—1). In fact, this correspondence 
between deterministic retirement policies and F-stopping times is a bijection. The 
Bayesian expected reward when following the policy associated with stopping 
time T is 


x 
a 


T—1 n 
5+ Soi] =E, 
t=1 t=T 


where U; = DE Zs + (n — t + l)u2. Since minimizing the Bayesian regret is 
equivalent to maximising the Bayesian expected cumulative reward, the problem 
of finding the Bayesian optimal policy has been reduced to an optimal stopping 
problem. 


PROPOSITION 35.4. If Zı is integrable, then the Bayesian regret is minimised by 
the retirement policy associated with stopping time T = min{t > 1 : U; = Er}, 
where 


B= Ur, ift=n+l1; 
' max{U;, E[Ei+1|Fi]}, otherwise. 


The interpretation of FE; is that it is the total expected optimal value 
conditioned on the information available at the start of round t. The proposition 
is an immediate corollary of Theorem 35.2 and the fact that integrability of 
Zı is equivalent to integrability of (U;)%'. The optimal stopping time in 
Proposition 35.4 can be rewritten in a more convenient form. For 1 <t<n+1, 
define W; = E — . Zs, which can be seen as the optimal value to go for the 


last n — t + 1 rounds. The definition of E, shows that W,41 = 0 and for t < n, 


t-1 
W: = max (o = t+ 1)uz, E[Erp | 7] — 5 z.) 
s=l 


= max ((n — t + l)u, E [Zi + Wisi | Fel) - (35.3) 


Hence the optimal stopping time can be rewritten as 
T* = min{t : U,= Ei} = min {t : W; = (n — t + l)u} . 


This should make intuitive sense. It is optimal to continue only if the expected 
future reward from doing so is at least as large as what can be obtained by 
stopping immediately. The difficulty is that E[Z, + W:+1ı | F] can be quite a 
complicated object. We now give two examples where E[Z; + W:+1 | F+] has a 
simple representation and thus computing the optimal stopping rule becomes 
practical. The idea is to find a sequence of sufficient statistics (S;)?_9 so that 
Sı € S is Fy-measurable and P,1(Z,...,Z: € - | S4) is independent of v. Then 
E; is o(S;)-measurable, and by Lemma 2.5 it follows that E; = v;(.S;) for an 
appropriately measurable function v : S —> R. For more on this, read the next 
two subsections, and then do Exercise 35.4. 


35.3.1 


35.3.2 
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Bernoulli Rewards 


Let € = [0,1], G = B((0, 1]) and for v € E, let P,ı = B(v) and P,2 = 4,,: the first 
arm is Bernoulli, and the second is a Dirac at some fixed value juz € [0,1]. For the 
prior, choose Q = Beta(a, 8), a Beta prior. By the argument in Example 34.6 the 
posterior at the start of round t is a Beta distribution Beta(a+ S4, 8 +t—1-— S+) 
where S; = S Zs. Letting p(s) = (a + s)/(a + B +t-— 1), it follows that 

a+ St 
a+8+t-1 
P (Siyi = St + 1| Si) = pel St), 

P (Sipi = Si | St) = 1 = ptl S) š 


lZ |F] = = pil St) , 


Now let wn+41(s) = 0 for all s and 


w;,(s) = max {(n — t + 1)u2, pe(s) + pe(s)weyi(s + 1) + (1 — p(s))wi+1(8)} . 


Then W, = w;(S;), and hence the optimal policy can be computed by evaluating 
w(s) for all s € {0,...,¢} starting with t = n, then n — 1 and so on until t = 1. 
The total computation for this backwards induction is O(n), and the output 
is a policy that can be implemented over all n rounds. By contrast, the typical 
frequentist stopping rule requires only O(n) computations, so the overhead is 
quite severe. The improvement in terms of the Bayes regret is not insignificant, 
however, as illustrated by the following experiment. 


EXPERIMENT 35.1 The horizon is set to n = 500 and u2 = 1/2. The stopping 
rules we compare are the Bayesian optimal policy with a Beta(1,1) prior and the 
‘frequentist’ stopping rule given by 


T = min fı > 2: fiz < pe and d(Ât-1, H2) > we} : (35.4) 
where d(p, q) is the binary relative entropy and fy; = yar X,/t is the empirical 
estimate of j11 based on the first t observations. Fig. 35.2 shows the expected regret 
for different values of u, with horizontal dotted lines indicating the expected regret 
averaged over the prior. Note that although the prior is symmetric, the one-armed 
bandit problem is not, which explicates the asymmetric behaviour of the Bayesian 
optimal algorithm. The frequentist algorithm is even more asymmetric with very 
small regret for u, > 1/2, but large regret for u, < 1/2. This is caused by a 
conservative confidence interval in Eq. (35.4), which makes it stop consistently 
later than its Bayesian counterpart, which makes it ‘win’ for yı > 1/2, but it also 
makes it ‘lose’ when py; < 1/2, with an overall loss (naturally) when considering 
the average over all environments. 


Gaussian Rewards 


Let (€,G) = (R, B(R)), where for v € R, we let Pı = N(v,1) and P2 = by, 
for u2 € R fixed. Choose a Gaussian prior Q = N (up, o?) with mean pp € R 
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— Bayesian optimal 
LEET Frequentist 


Expected regret 


Figure 35.2 The plot shows the expected regret for the Bayesian optimal algorithm 
compared to the ‘frequestist’ algorithm in Eq. (35.4) on the Bernoulli 1-armed bandit 
where u2 = 1/2 and pı varies on the z-axis. The horizontal lines show the average 
regret for each algorithm with respect to the prior, which is uniform. 


and variance c2 > 0. By the results in Section 34.3, the posterior Q(-|1,...,2t) 
after observing rewards x1,...,24 from the first arm is almost surely Gaussian 
with mean p and variance o? given by 


LP t = 
oO. + Daci Ts 1 i 
m= 2 ad ž = (++ =) ; (35.5) 
1 + Op Op 


The posterior variance is independent of the observations, so the posterior is 
determined entirely by its mean. As in the Bernoulli case, there exist functions 
(wet! such that W; = w:(ut—-1) almost surely for all ¢ € [n]. Precisely, 
Wn+1(f) = 0 and for t < n, 
1 ~ x 
w(u) = max (o —t+1)pe, wt Vn A exp 2) wilu + ajde) i 
(35.6) 


The integral on the right-hand side does not have a closed-form solution, which 
forces the use of approximate methods. Fortunately w, is a well-behaved function 
and can be efficiently approximated. The favourable properties are summarised 
in the next lemma, the proof of which is left to Exercise 35.5. 


LEMMA 35.5. The following hold: 


(a) The function w, is increasing. 
(b) The function w: is convex. 
(c) limy+oo w(u)/u =n- t+1 and limps- w(u) = (n — t + 1)u2. 


There are many ways to approximate a function, but in order to propagate 
the approximation using Eq. (35.6), it is convenient to choose a form for which 
the integral in Eq. (35.6) can be computed analytically. Given the properties in 
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Lemma 35.5, a natural choice is to approximate w; using piecewise quadratic 
functions. Let tn+41(j) = 0 and 


1 = 2 
wy (44) = max fin — t+ l)u2, wt =| exp (-=) Õlu + shar k 


—oo 
Then let —co < zı < £2 < ... < ay < œ, and for u € [zi, £i+1], define 
Ül u) = aiu? + bi + ci to be the unique quadratic approximation of w;(j2) such 
that 


w(x) = U(t), 
We(Li41) = Vt(Tit1), 
We( (i + Fi41)/2) = We((i + Zi41)/2). 


For u < 21, we approximate w(u) = (n — t+ 1)u2, and for u > xy, the linear 
approximation W;(4) = (n—t+1)p is reasonable by Lemma 35.5. The computation 
time for calculating the coefficients a;, bi, c; for all t and i € [N] is O(Nn). We 
encourage the reader to implement this algorithm and compare it to its natural 
frequentist competitors (Exercise 35.11). 


Gittins Index 


Generalising the analysis in the previous section to multiple actions 
is mathematically straightforward, but computationally intractable. The 
computational complexity of backwards induction increases exponentially with 
the number of arms, which is impractical unless the number of arms and horizon 
are both small. 

An index policy is a policy that in each round computes a real-valued index 
for each arm and plays the arm with the largest index, while the index of an 
arm is restricted to depend on statistics collected for that arm only (the time 
horizon can also be used). Many policies we met earlier are index policies. For 
example, most variants of the upper confidence bound algorithm introduced in 
Part II are index policies. Sadly, however, the Bayesian optimal policy for finite 
horizon bandits is not usually an index policy (see Note 6). John C. Gittins 
proved that if one is prepared to modify the objective to a special kind of infinite 
horizon problem, then the Bayesian optimal policy becomes an index policy. In 
the remainder of this chapter, we explore his ideas. 


A Discounted Retirement Game 


We start by describing the discounted setting with one action and then generalise 
to multiple actions. Besides discounting, another change is that the reward- 
generating process is made into a Markov reward process, a strict generalisation 
of the previous case. The motivation is that, as hinted on before, the posterior of 
the arm with the unknown payoff evolves as a Markov process. 
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Let (S;)?2,; be a Markov chain on Borel space (S,G) evolving according to 
probability kernel (P, : x € S). As in Section 35.2, let (Q, F, Pz) be a probability 
space carrying (Sn); with Sn € S such that 


n=1 


(a) P,(S; = x) = 1; and 
(b) Pi(Sn41 E€- | Sn) = Ps, (-) with P,-probability one. 


Expectations with respect to P, are denoted by Ez. Next, let y € R and r : S > R 
be a G/B(R)-measurable function, both of which are known to the learner. In 
each round t = 1,2,..., the learner observes the state S; and chooses one of two 
options: (a) to retire and end the game or (b) pay the fixed cost y to receive 
a reward of r(S;) and continue for another round. The policy of a learner in 
this game corresponds to choosing a F-stopping time 7 with F = (F;), and 
F; = 0(S\,..., St), where T = t means that the learner retires after observing 
S at the start of round t. The a-discounted value of the game when starting in 
state S1 = x is 


v(x) = sup Ez B att (r(S+) — »)| i (35.7) 


TZA t=1 


where a € (0,1) is the discount factor. To ensure that this is well defined, we 
need the following assumption: 


ASSUMPTION 35.6. For all z € S, it holds that E, > a(s < 00. 
t=1 


If the rewards are bounded, the assumption will hold. When the rewards are 
unbounded, the assumption restricts the rate of growth of rewards over time. 

The presence of discounting encourages the learner to obtain large rewards 
earlier rather than later and is one distinction between this model and the finite- 
horizon model studied for most of this book. A brief discussion of discounting is 
left for the notes. 

Fix a state x € S. The map y > v(x) is decreasing and is always non-negative. 
In fact, if y is large enough, it is easy to see that retiring immediately (7 = 1) 
achieves the supremum in the definition of v(x), and thus v,(x) = 0. The 
Gittins index, or fair charge, of a state x is the smallest value of y for which 
the learner is indifferent between retiring immediately and playing for at least 
one round: 


g(x) = inf {y E R : v(x) = 0} . (35.8) 

Straightforward manipulation (Exercise 35.6) shows that 
Ee [E rs] 

g(x) = sup ; 


T-1 
T>2 Ea | i att] 


(35.9) 


The form in (35.9) will be useful for computation. It is not immediately clear that 


35.4.2 
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a stopping time attaining the supremum in (35.9) exists. The following lemma 
shows that it does and gives an explicit form. 


LEMMA 35.7. Let x € S be arbitrary. The following hold under Assumption 35.6: 


(a) v(x) = max{0, r(x) — y +a fs vy(y)Pe(dy)} for ally € R. 

(b) Ify < g(a), then v(x) = r(x) — y +a fs vy(y) Pe (dy). 

(c) The stopping time T = min{t > 2 : g(S+) < y} attains the supremum in 
Eq. (35.9). 


The result is relatively intuitive. The Gittins index represents the price the 
learner should be willing to pay for the privilege of continuing to play. The optimal 
policy continues to play as long as the actual value of the game is not smaller 
than this price was at the start. The proof of Lemma 35.7 uses Theorem 35.3 
and is left for the reader in Exercise 35.7. 


Discounted Bandits and the Index Theorem 


The generalisation of the discounted retirement game to multiple arms is quite 
straightforward. As we will see, this will lead to a solution to the infinite horizon 
discounted Bayesian k-armed bandit problem where the prior factorises over the 
arms. 

There are now k independent Markov chains sharing the same state space S. 
We are also given a reward function r : S > R. In each round t, the learner first 
observes the state of all chains S1(t), . . . , S(t) and then chooses an action A; € [k] 
to receive a reward r(S4,(t)) and to make the state of the chain underlying arm 
k move according to a fixed transition kernel that is common to all chains. The 
states of the other chains do not move. The goal is still to maximise the total 
expected discounted reward. The interaction protocol is illustrated on Fig. 35.3. 


The assumption that the Markov chains evolve on the same state space with 
the same transition kernel is non-restrictive since the state space can always 
be taken to be the union of k state spaces and the transition kernel defined 
with k disconnected components. 


Because the learner observes the state of all chains in each round, a policy 7 
now is a collection (7;)?2,, where 7 is a probability kernel from (S* x [k])'~1 x S* 
(history, including past observed states and actions) to [k]. Given a discount 
rate a € (0,1), the objective is to find the policy maximising the cumulative 
discounted reward: 


argmax,, [Seat r(Sa,(t » ; 


where the expectation is taken with respect to the distribution on state/action 
sequences induced by the interaction of m and the k Markov chains. 
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t = 1 and initialise $(1),...,S,(1) 


Observe states Si(t),...,Sz(t) H 


Choose action A; € [k] Increment t 
ry 
y Update S;(t + 1) = Si(¢) for 
Receive reward r(S'4,(t)) [>| i # At, and Sa (t+ 1) ~ Ps, wC) 


Figure 35.3 Interaction protocol for discounted bandits with Markov pay-offs 


EXAMPLE 35.8 (Bayesian k-armed Bernoulli bandits in the Markov framework). 
To see the relation to Bayesian bandits with discounted rewards, consider the 
following set-up. Let S = [0, 00) x [0, 00) and G = B(S). Then let the initial state 
of each Markov chain be $;(1) = (1,1), and define probability kernel (P, : s € S) 
from (S,G) to itself by 


ax 


The reward function is r(x, y) = «/(a + y). The reader should check that this 
corresponds to a Bernoulli bandit with Beta(1,1) prior on the mean reward of 
each arm (Exercise 35.8). The role of the state space is to maintain a sufficient 
statistic for the posterior while the reward function is the expected reward given 
the posterior. 


Ta((x+1,y)) + 2 a 1)). 


Returning to the general problem, let g be the Gittins index function as defined 
in Eq. (35.8) associated with the probability kernel (Py : x € S) and reward 
function r. A policy 7* that chooses in round t the arm A; € argmax;ejp] 9(Si(¢)) 
is called a Gittins index policy. One of the most celebrated theorems in the 
study of bandits is that these policies are Bayesian optimal. 


THEOREM 35.9. Let * be a policy choosing in round t A; = argmax;, g(S;(t)) 
with ties broken arbitrarily. Then, provided Assumption 35.6 holds for all Markov 
chains (Sin)o1, then 


m [eat r(Sa,(t )| =a or r(Sa,(t |; 


u=1? 


Tv 
where the supremum is taken over all policies. 


The remainder of the section is devoted to proving Theorem 35.9. The choice 
of actions produces an interleaving of the rewards generated by each Markov 
chain, and it will be useful to have a notation for these interleavings. For each 
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i € [k], let gi = (git)? be a real-valued sequence and g = (g1,..-,9%) be the 
tuple of these sequences. 


While this notation breaks our convention of putting the time index first in 
the reward sequences of a multi-armed bandit, we prefer this notation here 
as we need to consider reward sequences underlying individual arms. 


Given an infinite sequence (a;)?2,, taking values in [k], define the interleaving 
sequence I(g,a) = (I: (g, a))?2ı by 


L(g, a) = Jat, 1na; (a,t—1) with (a, t— 1) 5 I {as = = i}. 


Note that this is the same as the ‘reward-stack model’ of bandits mentioned on 
page 65 in Chapter 4 except that here we have fixed sequences. The next lemma 
follows from the Hardy—Littlewood inequality, a generalisation of the trivial 
observation that the identical ordering of two sequences of numbers maximises 
their inner product. We leave the proof to Exercise 35.9. 


LEMMA 35.10. Suppose that g; is decreasing for alli € |k] and (a;)?2, is defined 
recursively by af = argmax; 9i,14n,(a*,t-1) and I*(g) = I(g,a*). Then, for any 
= (0,1), 


co 

5 at I} (g) = sup pa a I (g,a). 

t=1 ac[k]" p21 
Proof of Theorem 35.9 Given a policy m = (m), let (Q, F,P,) be a 
probability space carrying random elements S1,...,5%, where Si = (Siu): 
is a sequence of states and (A;)?2, is a sequence of actions such that 


La) Pe (Spas € * | Si, Sis cg Siu) = Pol 

(b) The sequences (Siu): and (Sju)gı are independent for all i # j; and 

(c) P(A: € -| S(1), Ai,-.--,; At—1, S(t) = Tl | S(1), A, ---, At—1, S(t)), where 
Si(t) = Si,147,(t-1)) is the state of machine 7 observed by the learner at the 
start of round t with T;(t) = $% 1{A, = i}. 


s=1 


Let Fe = o(S(1), Ai, S$(2),...,At-1, S(t), Az) be the o-algebra containing 
information available to the learner after choosing their action in round t. As 
usual, E, denotes the expectation with respect to Pr. 

Given an arm i and round t, the prevailing charge is a random variable 
G,(t) = ming<; g(Si(s)). The name comes from one of the early proofs of Gittins 
theorem that constructed a game in which the prevailing charge was the fee paid 
by the learner to play arm 7 in round t. The proof is decomposed into two steps. 
In the first step, we relate the prevailing charge to the discounted cumulative 
reward. The second step completes the proof by combining the first with an 
interleaving argument using Lemma 35.10. 
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Part 1: The Prevailing Charge 
Fix an arm i. We claim that 


[oa r(A = JE Ore oot 1G (e)1 {A; = i} 


Furthermore, equality holds for 7 = 1*. To prove this claim, let 7,72,... be a 
sequence of stopping times defined recursively by 


Tı =min{t>1:A,=i} and 
Tj+1 = min{t > Tj : Ay =i and g(5;,(t)) < G,(7;)}, 


where the minimum of the empty set is defined to be infinite. Next, let 
Tj ={t:A,y=iandt,<t<7)41} and yj =G,(7;). 


Note that on the event {rj < co}, G,(t) = y; for all t € T;. Furthermore, 
g(Si(7;)) = yj. By definition, we have 


Seat (r(Si(t)) GOHA = a]- SE, [S ates- 


j=1 teT; 


The claim follows by showing the term inside the sum on the right-hand side 
vanishes for the Gittins index policy and is not positive for any other policy. 

Fix j > 1. By definition, for t € T} it holds that g;(Si(t)) > G,(t) = y. 
Combining this with Part (b) of Lemma 35.7, on {t € Tj}, thanks to {t € Tj} € 
Ft, 


vy (SiE) +y = (Si (4) = a f vn (y) Psie (dy) = aEr[vy; (Sit + 1)) | F]. 


From this it follows that 


„|>, at! (r( -=y)| = TPA al" (vy, (Si(t)) — avy (Silt + 1))) 


teT; teT; 


where the final inequality holds since vy, is non-negative, v,,(Si(7;)) = 0 and by 
telescoping the sum, which is possible because whenever t’ is the smallest element 
larger than t in T}, then S;(t’) = S;(t + 1). We now argue that the inequality 
is replaced by an equality for the Gittins index policy. The key observation 
is that having played A,, = i, the Gittins index policy continues playing arm 
i until g(S;(t)) < yj, which means that Tj = {7;,7; + 1,...,«%; — 1}, where 
kj = min{t > Tj : g(Si(t)) < y}, which by Part (c) of Lemma 35.7 means that 


~ | So at (r(Si(t)) — 7) | Fr; | = vy) (Si(7)) = 0- 


teT; 
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Part 2: Interleaving Prevailing Charges 

Let Hiu = ming<u g(Siv). The key point is that the distribution of H = (Hin) 
does not depend on the choice of policy, and clearly Hiu is decreasing in u for 
each 7. For the Gittins index policy 1*, 


p [Ears 


t= 
= E, $ a1 (H, A) 


t= 


= E,. bs at-1T*(H) 


t= 


? 


where the first equality follows from part 1, the second by the definition of [; 
and H and the third by the definitions of J* from Lemma 35.10 and that of 
the Gittins index policy, which always chooses an action that maximises the 
prevailing charge. On the other hand, for any policy 7, 


[doa r(Sa,(t » < Er Sa Ga, (t) 
t=1 
= ys 2 at I,(H, A) 
t=1 
<E, Seat I(H)| , 
t=1 


where the last line follows from Lemma 35.10. Finally, note that the law of H 
under P, does not depend on m, and hence 


7 [Sazan] = Ep 5 ou) 


Therefore, for all z, 


gereu] Eero] 


which completes the proof. 


Computing the Gittins Index 


We describe a simple approach that depends on the state space being finite. 
References to more general methods are given in the bibliographic remarks. 
Assume without loss of generality that S = {1,2,...,|S|} and G = 2°. The matrix 
form of the transition kernel is P € [0,1]!S!*'5! and is defined by P;; = P;({j}). 
We also let r € [0,1]|5! be the vector of rewards so that r; = r(i). The standard 
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basis vector is e; € RISI, and 1 € RIS! is the vector with one in every coordinate. 
For C C S, let Qc be the transition matrix with (Qc)i; = Pijlc(j). For each 
i € S, the goal is to find 


| ts Di af tr(S4)] 
gli) = sup — 


T>2 ni | ~T rai 


si t=1 


where E; is the expectation with respect to the measure P; for which the initial 
state is S4 = i. Lemma 35.7 shows that the stopping time T = min{t > 2 : g(S:) < 
g(i)} attains the supremum in the above display. The set C; = {j : g(j) > gli)} 
is called the continuation region, and S; = S \ C; is the stopping region. Then 
the Gittins index can be calculated as 


u [ES] pe atte Qists elU- ogo) -r 
T- aQe) TI 


t 
(2 


a = = = — = ee 
g6) x | a at=] Dr atte; ome ei 


All this suggests an induction approach where the Gittins index is calculated 
for each state in decreasing order of their indices. To get started, note that the 
maximum possible Gittins index is max; r; and that this is achievable for state 
i = argmax;rj with the deterministic stopping time 7 = 2. For the induction 
step, assume that g(i) is known for the j states C = {i1,i2,...,ij} with the 
largest Gittins indices. Then 7;+1 is given by 


e] (I—aQc)71r 
e I- aQc)7!1 ` 


tj+1 = argmaX;gc 


If Gauss-Jordan elimination is used for matrix inversion, then the computational 
complexity of this algorithm is O(|S|*). A more sophisticated inversion algorithm 
would reduce the complexity to O(|S|°**) for some £ < 0.373, but these are 
seldom practical. When a is relatively small, the inversion can be replaced by 
directly calculating the sums to some truncated horizon with little loss in accuracy. 


Notes 


1 Bayesian methods automatically and optimally exploit the assumptions encoded 
in their prior. If we think of the prior as a way of enriching and refining the 
standard formulation of bandits, this is an advantage. However, this blessing 
can also be a curse. A policy that exploits its assumptions too heavily can 
be brittle when those assumptions turn out to be wrong. This can have a 
devastating effect in bandits where the cost of overly aggressive confidence 
intervals is large. 

2 We claimed that computing the Bayesian optimal policy is generally intractable 
without discounting. This is a widely held belief, but we are not aware of any 
lower-bound on the computation complexity. A good place to start might be 
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to lower bound the computation complexity of finding the optimal action for 
k-armed Bayesian bandits when the prior is a product of Beta distributions, 
but without discounting. 

The solution to optimal stopping problems is essentially a form of dynamic 
programming, which is a method that trades memory for computation by 
introducing recursively defined value functions that suffice for reconstructing an 
optimal policy. In the one-armed bandit optimal stopping problem, thanks to 
the factorisation lemma (Lemma 2.5), for any 0 < t < n, there exists a function 
w : R? > R such that W; = w;(X1,..., X+) almost surely. This function can 
be seen as the value function that captures the optimal value-to-go from stage 
t on, and (35.3) gives a recursive construction for it, w,(a1,...,2n) = 0, and 
fort <n, 


w;(@1,..., 24) = max((n — Dua f aes + Wii (£1, - , Et, t41 OP, (weg) 


where P, is the distribution of X14, given X1,..., X+. The problem with 
this general recursion is that the computation is prohibitive. The example 
with Bernoulli rewards shows that sometimes a similar recursion holds on a 
reduced ‘state space’ that avoids the combinatorial explosion that typically 
arises. For Gaussian rewards, even the reduced ‘state space’ was uncountably 
large, and a piecewise quadratic approximation was suggested. When this 
kind of approximation is used, we get an instance of approximate dynamic 
programming. 

Discounted bandits with Markov pay-offs (Fig. 35.3) are a special case of 
discounted Markov decision processes on which there is a large literature. More 
details are in the bibliographic remarks in Chapter 38. 

Economists have long recognised the role of time in the utility people place on 
rewards. Most people view a promise of pizza (freshly made) a year from today 
as less valuable than the same pizza tomorrow. Discounting rewards is one way 
to model this kind of preference. The formal model is credited to renowned 
American economist Paul Samuelson [1937], who, according to Frederick et al. 
[2002], had serious reservations about both the normative and descriptive value 
of the model. While discounting is not very common in the frequentist bandit 
literature, it appears often in reinforcement learning, where it offers certain 
technical advantages [Sutton and Barto, 1998]. 

Theorem 35.9 only holds for geometric discounting. If a’! is replaced by a(t), 
where a(-) is not an exponential, then one can construct Markov chains for 
which the optimal policy is not an index policy. The intuition behind this result 
is that when a(t) is not an exponential function, then the Gittins index of 
an arm can change even in rounds you play a different arm, and this breaks 
the interleaving argument [Berry and Fristedt, 1985, chapter 6]. The Gittins 
index theorem is brittle in other ways. For example, it no longer holds in the 
multiple-play setting, where the learner can choose multiple arms in each round 
[Pandelis and Teneketzis, 1999]. 
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The previous note does not apply to one-armed bandits for which the 
interleaving argument is not required. Given a Markov chain (.S;)¢ and horizon 
n, the undiscounted Gittins index of state s is 


nls) =< wp PEE 


2<7T<n La [r = 1] 


If the learner receives reward u2 by retiring, then the Bayesian optimal policy 
is to retire in the first round t when gn-t+1(5¢) < H2. A reasonable strategy 
for undiscounted k-armed bandits is to play the arm A; that maximises 
9n—t+1(5;(t)). Although this strategy is not Bayesian optimal anymore, it 
nevertheless performs well in practice. In the Gaussian case, it even enjoys 
frequentist regret guarantees similar to UCB [Lattimore, 2016a]. 

The form of the undiscounted Gittins index was analysed asymptotically 
by Burnetas and Katehakis [1997b], who showed the index behaves like the 
upper confidence bound provided by KL-UCB. This should not be especially 
surprising and explains the performance of the algorithm in the previous note. 
The asymptotic nature of the result does not make it suitable for proving regret 
guarantees, however. 

We mentioned that computing the Bayesian optimal policy in finite horizon 
bandits is computationally intractable. But this is not quite true if n is small. 
For example, when n = 50 and k = 5, the dynamic program for computing 
the exact Bayesian optimal policy for Bernoulli noise and Beta prior has 
approximately 10'! states. A big number to be sure, but not so large that the 
table cannot be stored on disk. And this is without any serious effort to exploit 
symmetries. For mission-critical applications with small horizon, the benefits 
of exact optimality might make the computation worth the hassle. 

The algorithm in Section 35.5 for computing Gittins index is called Varaiya’s 
algorithm. In the bibliographic remarks, we give some pointers on where to 
look for more sophisticated methods. The assumption that |S] is finite is less 
severe than it may appear. When the discount rate is not too close to one, then 
for many problems the Gittins index can be approximated by removing states 
that are not reachable from the start state before the discounting means they 
becomes close to irrelevant. When the state space is infinite, there is often a 
topological structure that makes a discretisation possible. 


Bibliographical Remarks 


The classic text on optimal stopping is by Robbins et al. [1971], while a 
more modern text is by Peskir and Shiryaev [2006], which includes a proof 
of Theorem 35.2 (see theorem 1.2). With a little extra work, you can also extract 
the proof of Theorem 35.3 from section 1.2 of that book. We are not aware 
of a reference for Theorem 35.1, but Lai [1987] has shown that for sufficiently 
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regular priors and noise models, the asymptotic Bayesian optimal regret is 
BR* ~ clog(n)? for some constant c > 0 that depends on the prior/model 
(see theorem 3 of Lai [1987]). The Bayesian approach dominated research on 
bandits from 1960 to 1980, with Gittins’s result (Theorem 35.9) receiving the 
most attention (Gittins, 1979]. Gittins et al. [2011] has written a whole book on 
Bayesian bandits. Another book that focusses mostly on the Bayesian problem is 
by Berry and Fristedt [1985]. Although it is now more than 30 years old, this book 
is still a worthwhile read and presents many curious and unintuitive results about 
exact Bayesian policies. The book by Presman and Sonin [1990] also considers 
the Bayesian case. As compared to the other books, here the emphasis is on a 
case that is more similar to partial monitoring, the subject of Chapter 37 (in the 
adversarial setting). As far as we know, the earliest fully Bayesian analysis is by 
Bradt et al. [1956], who studied the finite horizon Bayesian one-armed bandit 
problem, essentially writing down the optimal policy using backwards induction, 
as presented here in Section 35.3. More general ‘approximation results’ are shown 
by Burnetas and Katehakis [2003], who show that under weak assumptions the 
Bayesian optimal strategy for one-armed bandits is asymptotically approximated 
by a retirement policy reminiscent of Eq. (35.4). The very specific approach 
to approximating the Bayesian strategy for Gaussian one-armed bandits is by 
one of the authors [Lattimore, 2016a], where a precise approximation for this 
special case is also given. There are at least four proofs of Gittins’s theorem 
(Gittins, 1979, Whittle, 1980, Weber, 1992, Tsitsiklis, 1994]. All are summarised 
in the review by Frostig and Weiss [1999]. There is a line of work on computing 
and/or approximating the Gittins index, which we cannot do justice to. The 
approach presented here for finite state spaces is due to Varaiya et al. [1985], but 
more sophisticated algorithms exist with better guarantees. A nice survey is by 
Chakravorty and Mahajan [2014], but see also the articles by Chen and Katehakis 
[1986], Kallenberg [1986], Sonin [2008], Niño-Mora [2011] and Chakravorty and 
Mahajan [2013]. There is also a line of work on approximations of the Gittins 
index, most of which are based on approximating the discrete time stopping 
problem with continuous time and applying free boundary methods [Yao, 2006, 
and references therein]. The Gittins index has been generalised to continuous 
time, where the challenge is to ensure the existence of solutions to the resulting 
stochastic differential equations [Karoui and Karatzas, 1994]. We mentioned 
restless bandits in Chapter 31 on non-stationary bandits, but they are usually 
studied in the Bayesian context [Whittle, 1988, Weber and Weiss, 1990]. The 
difference is that now the Markov chains for all actions evolve regardless of the 
action chosen, but the learner only gets to observe the new state for the action 
they chose. 


Exercises 


35.1 (BOUNDING THE BAYESIAN OPTIMAL REGRET) Prove Theorem 35.1. 
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HINT For the first part, you should use the existence of a policy for Bernoulli 
bandits such that 


Rala, v) < Cmin | Vin, oN 


where C > 0 is a universal constant and Ayjn(v) is the smallest positive 
suboptimality gap. Then let €, be a set of bandits for which there exists a 
small enough positive suboptimality gap and integrate the above bound on En 
and €°. The second part is left as a challenge, though the solution is available. 


35.2 (FINITE HORIZON OPTIMAL STOPPING) Prove Theorem 35.2. 


HINT Prove that (£;,)?_, is a F-adapted supermartingale and that for stopping 
time 7 satisfying the conditions of the theorem that (M;,)?_, defined by Mi = Ei, 
is a martingale. Then apply the optional stopping theorem (Theorem 3.8). 


35.3 (INFINITE HORIZON OPTIMAL STOPPING) Prove Theorem 35.3. 


HINT This is a technical exercise. Use theorem 1.7 of Peskir and Shiryaev [2006], 
and pass to the limit using the almost-sure convergence of (U+); as t > oo. You 
may find the ideas in the proof of theorem 1.11 of the same book useful. Be 
careful, Peskir and Shiryaev adopt the convention that stopping times are almost 
surely finite, while here we permit infinite stopping times. 


35.4 This exercise uses the notation and setting of Section 35.3. Suppose that 
(St)? 9 is a sequence of random elements taking values in measurable space (S, H) 
and with S; being F,/H-measurable and P,1(Z1,...,Z: € - | S;) is independent 
of v. Show that F; is o(S;)-measurable, and there exists a H./%8(R))-measurable 
function v : S — R such that Ei = v;(S;). You may assume that (€,G) is Borel. 


35.5 Prove Lemma 35.5. 


35.6 (EQUIVALENCE OF GITTINS DEFINITIONS) Prove that the definitions of 
the Gittins index given in Eq. (35.8) and Eq. (35.9) are equivalent. 


35.7 Prove Lemma 35.7. 


HINT Find a way to apply Theorem 35.3. 


35.8 Consider that the discounted bandit with Markov pay-offs described in 
Example 35.8. Show that there is a one-to-one correspondence ¢ between the 
policies for this problem and the discounted Bayesian bandit with Beta(1,1) on 
the mean reward of each arm such that the total expected discounted reward 
(value) is invariant under ¢. 


35.9 Prove Lemma 35.10. 


HINT Use the Hardy—Littlewood inequality, which for infinite sequences states 
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that for any real, increasing sequences (%p)°21, (Yn)°, and any bijection 
o : Nt +N? it holds that 7°, tnyn > Xp EnYo(n): 


35.10 (CORRECTNESS OF VARAIYA’S ALGORITHM) Prove the correctness of 
Varaiya’s algorithm, as explained in Section 35.5. 


35.11 In this exercise, you will implement some Bayesian (near-)optimal 1-armed 
bandit algorithms. 


(a) Reproduce the experimental results in Experiment 1. 

(b) Implement an approximation of the optimal policy for one-armed Gaussian 
bandits and compare its performance to the stopping rule Ta defined below 
for a variety of different choices of a > 0. 


2 l t 
Ta = min f> 2: ae y = ee )} < m} l 
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Thompson Sampling 


“As all things come to an end, even this story, a day came at last when they were in 
sight of the country where Bilbo had been born and bred, where the shapes of the land 
and of the trees were as well known to him as his hands and toes.” — Tolkien [1937]. 


Like Bilbo, as the end nears, we return to where it all began, to the first algorithm 
for bandits proposed by Thompson [1933]. The idea is a simple one. Before the 
game starts, the learner chooses a prior over a set of possible bandit environments. 
In each round, the learner samples an environment from the posterior and 
acts according to the optimal action in that environment. Thompson only gave 
empirical evidence (calculated by hand) and focused on Bernoulli bandits with 
two arms. Nowadays these limitations have been eliminated, and theoretical 
guarantees have been proven demonstrating the approach is often close to optimal 
in a wide range of settings. Perhaps more importantly, the resulting algorithms 
are often quite practical both in terms of computation and empirical performance. 
The idea of sampling from the posterior and playing the optimal action is called 
Thompson sampling, or posterior sampling. 

The exploration in Thompson sampling comes from the randomisation. If the 
posterior is poorly concentrated, then the fluctuations in the samples are expected 
to be large and the policy will likely explore. On the other hand, as more data 
is collected, the posterior concentrates towards the true environment and the 
rate of exploration decreases. We focus our attention on finite-armed stochastic 
bandits and linear stochastic bandits, but Thompson sampling has been extended 
to all kinds of models, as explained in the bibliographic remarks. 


Randomisation is crucial for adversarial bandit algorithms and can be useful 
in stochastic settings (see Chapters 23 and 32 for examples). We should 
be wary, however, that injecting noise into our algorithms might come at a 
cost in terms of variance. What is gained or lost by the randomisation in 
Thompson sampling is still not clear, but we leave this cautionary note as a 
suggestion to the reader to think about some of the costs and benefits. 
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Finite-Armed Bandits 


Recalling the notation from Section 34.5, let k > 1 and (€,B(E),Q,P) be a 
k-armed Bayesian bandit environment. The learner chooses actions (A;)?_, and 
receives rewards (X;)?_,, and the posterior after t observations is a probability 
kernel Q(-|-) from ([k] x R)* to (€,8(E)). Denote the mean of the ith arm in 
bandit v € E by m(v) = fg edP,;(z). In round t, Thompson sampling samples 
a bandit environment v; from the posterior of Q given A1, X1,...,At—1, Xt-1 
and then chooses the arm with the largest mean (Algorithm 23). A more precise 
definition is that Thompson sampling is the policy m = (7,)?2, with 


mila | G1, %1,---,Qt-1, Ve-1) = Q(Ba| 1, 21,---, 4-1, 2t-1) , 


where Ba = {v € E : a = argmax, m(v)} € B(E), with ties in the argmax are 
resolved in an arbitrary, but systematic fashion. 


1: Input Bayesian bandit environment (E, B(E), Q, P) 
2: for t =1,2,...,n do 

3: Sample r ~ Q(-| Ai, X1,..., At—1, Xt-1) 

4 Choose A; = argmaxje ,y Hile) 

5: end for 


Algorithm 23: Thompson sampling. 


Thompson sampling has been analysed in both the frequentist and the Bayesian 
settings. We start with the latter where the result requires almost no assumptions 
on the prior. In fact, after one small observation about Thompson sampling, the 
analysis is almost the same as that of UCB. 


THEOREM 36.1. Let (€,B(E),Q,P) be a k-armed Bayesian bandit environment 
such that for allv € E andi € |k], the distribution P,; is 1-subgaussian (after 
centering) with mean in [0,1]. Then the policy x of Thompson sampling satisfies 


BR,(m,Q) < Cy/knlog(n) , 
where C > 0 ts a universal constant. 


Proof Abbreviate p; = pi(v) and let A* = argmaXx;cjx] Hi be the optimal arm, 
which depends on v and is a random variable. When there are ties, we use the 
same tie-breaking rule as in the algorithm in the definition of A*. For each t € [n] 
and i € [k], let 


0 tna (reo fran) 


where /i;(t — 1) is the empirical estimate of the reward of arm i after t — 1 rounds 
and we assume fi;(t — 1) = 0 if T;(t — 1) = 0. Let E be the event that for all 
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t € fn] and i € [k], 


2 log(1/6) 


Ailt — 1) — ui Fe 
|fui(t — 1) — mil < IVTG-I) 


In Exercise 36.2, we ask you to prove that P(E°) < 2nkd. Let Fe = 
o( Ai, X1,..., Az, X+) be the o-algebra generated by the interaction sequence 
by the end of round t. Note that U,(i) is #,-1-measurable. The Bayesian regret is 


BR, =E paar = pao) =E > i [HA> — MA, Fl 


t=1 t=1 


The key insight (Exercise 36.3) is to notice that the definition of Thompson 
sampling implies the conditional distributions of A* and A; given F,_1 are the 
same: 
P (A* = -|Fr-1) = P(A; = -|Fr-1) a.s. (36.1) 
Using the previous display, 
i [ax — WA, | Fe-1] = E [ua — Ui(At) + Ue(At) — wa, | Ft-1] 
= E [uas — U2(A*) + Ui(At) — Ha | Fi] (Eq. (36.1) 
= E [uar — Ui(A*) | Fi-1] + E [Ue (At) — wa, | Fe-1] - 


Using the tower rule for expectation shows that 


BRy sE o- U;( A*)) D+» (U: (At) - na| . (36.2) 
t=1 


t=1 
On the event E° the terms inside the expectation are bounded by 2n, while on 
the event E, the first sum is negative and the second is bounded by 


1{E} X (U4) — HA) -SEHA = i} (Ui(i) — pi) 


t=1 t=1 i=1 


aa 8 log ( Tir) [gy (1/8) , 
SODA = H fee ES yE 


i=1 t=1 


= > /32T,(n) log(1/5) < ./32nk log(1/6). 


The proof is completed by choosing ô = n~? and the fact that P (E°) < 2nkô. 


Frequentist Analysis 


Bounding the frequentist regret of Thompson sampling is more technical than the 
Bayesian regret. The trouble is the frequentist regret does not have an expectation 
with respect to the prior, which means that A; is not conditionally distributed in 
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the same way as the optimal action (which is not random). Thompson sampling 
can be viewed as an instantiation of follow-the-perturbed-leader, which we already 
saw in action for adversarial combinatorial semi-bandits in Chapter 30. Here we 
work with the stochastic setting and consider the general form algorithm given 
in Algorithm 24. 


1: Input Cumulative distribution functions F3 (1), ..., £,(1) 
2: fort =1,...,n do 
3: Sample 6;(t) ~ F(t) independently for each i 
4: Choose A; = argmax;e ,) 9i(t) 
5: Observe X; and update: 
Fi(t +1) = F(t) fori A A, and Fy,(t+1) = UPDATE(F'4,(t), At, Xt) 
6: end for 


Algorithm 24: Follow-the-perturbed-leader 


Thompson sampling is recovered by choosing F\(1),...,/%(1) to be the 
cumulative distribution functions of the mean reward of each arm for a prior 
that is independent over the arms (a product prior). Then letting UPDATE be 
the function that updates the posterior for the played arm. There are, however, 
many alternatives ways to configure this algorithm. 


The core property that we use in the analysis of Algorithm 24 (to be 
presented soon) is that F;(¢+ 1) = F;(t) whenever A; 4 i. When UPDATE(-) 
is a Bayesian update this corresponds to choosing an independent prior on 
the distribution of each arm. 


Let F;, be the cumulative distribution function used for arm 7 in all rounds 
t with T;(t — 1) = s. This quantity is defined even if T;(n) < s by using the 
reward-stack model from Section 4.6. 


THEOREM 36.2. Assume that arm 1 is optimal. Let i > 1 be an action and € € R 
be arbitrary. Then the expected number of times Algorithm 24 plays action i is 
bounded by 


[T(n] < 1+ a3 (= : 1) 


where Gis = 1 — Fis(u1 — £). 


+ Ene. > unt (36.3) 
s=0 


In applications, ¢ is normally chosen to be a small positive constant. In this case, 
the first sum in Eq. (36.3) measures the probability that the sample corresponding 
to the first arm is nearly optimistic and tends to be smaller when the variance 
of the perturbation is larger. The second sum measures the likelihood that 
the sample from arm 7 is close to pı and is small when the variance of the 
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perturbation is small. Balancing these two terms corresponds to optimising the 
exploration/exploitation trade-off. 


Proof of Theorem 36.2 Let Fy = o(Ai,X1,...,At,X¢) and E;(t) = {0;(t) < 
Hı — £}. By definition, 


P (0; (€) = H1 =E | Fi-1) = Gin (t-1) a.s. 


We start with a straightforward decomposition: 


[Ti (n)] = E ra: =i} 


-E ra= + [yeaa = 0) 


In order to bound the first term, let A; = argmax;,, 6;(t). Then 


(36.4) 


P(A; => 1, E;(t) | Fi—1) > P(A; = i, Eilt), 01 (t) > Hı = € | Fi1) 
= P(0,(¢) > m — €|Fy_1) P(A, = i, x(t) | Fe-1) 
> GiT (t-1) 
1 — Gin (t-1) 


P (A; = i, Ei(t)|Ft-1) , (36.5) 


where in the first equality we used the fact that 0: (t) is conditionally independent 
of Ai, and £;(t) given 7;_1. In the second inequality, we used the definition of 
G1, and the fact that 


P(A; = i, Eilt) | Fi-1) < (1 — P (81 (t) > p — € | Fe-1))P (Aj = i, Eilt) | Fi-1) , 


which is true since {A; = i, E;(t) occurs} C {A} = i, E,(t) occurs} N {01 (t) < 
Hı — €}, and the two intersected events are conditionally independent given F;—1. 
Therefore using Eq. (36.5), we have 


1 
P(A; = i, Eilt) | Fe-1) < (— _ 1) P (Ay = 1, E;(t) | Fx) 
1T, (t—1) 


1 
< (— -1) P(A =1|Fia)- 
1T, (t—1) 


Substituting this into the first term in Eq. (36.4) leads to 


[Eraio oecus) < | (z = 1) P= 11%.) 


t=1 


z 1 
J ——— -1])I{A;=1 
3 (as ) 1A j 


<E B (= z 1) ; (36.6) 


s=0 


where in the last step we used the fact that T\(¢— 1) = s is only possible for one 
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round where A; = 1. Let T = {t € [n] : 1 — Fir;a-1)(H1 — €) > 1/n}. After some 
calculation (Exercise 36.5), we get 


3 [Era = i, Ef(t) sen <E [Sra =i} 


+E [Enero 


tET tT 
n—1 
1 
<E |X- Fis(m —€) > 1/n}| +E 2 
s=0 tT 


+1. 


< [Enen > ayn) 


Putting together the pieces completes the proof. 


By instantiating Algorithm 24 with different choices of perturbations, one 
can prove that Thompson sampling enjoys frequentist guarantees in a number 
of settings. The following theorem shows that Thompson sampling with an 
appropriate prior is asymptotically optimal for the set of Gaussian bandits. 
The reader is invited to prove this result by following the steps suggested in 
Exercise 36.6. 


THEOREM 36.3. Suppose that F;(1) = œ~ is the Dirac at infinity and let 
UPDATE(F;(t), Ay, X+) be the cumulative distribution function of the Gaussian 
N (Ailt), 1/t). Then the regret of Algorithm 24 on Gaussian bandit v € EX-(1) 
satisfies 


Rn 2 
lim = —. 
PB g(r) T 2, A; 


Furthermore, there exists a universal constant C > 0 such that Rp < 


Cy/nk log(n). 


The choice of update and initial distributions in Theorem 36.3 correspond to 
Thompson sampling when the prior mean and variance are sent to infinity at 
appropriate rates. For this choice, a finite-time analysis is also possible (see the 
exercise). 


EXPERIMENT 36.1 Empirically the algorithm described in Theorem 36.3 has a 
smaller expected regret than the version of UCB analysed in Chapter 7. Compared 
to more sophisticated algorithms, however, it has larger regret and larger variance. 
AdaUCB (which we briefly met in Section 9.3) and Thompson sampling were 
simulated on a two-armed Gaussian bandit with mean vector u = (1/5,0) and 
unit variance and a horizon of n = 2000. The expected regret as estimated 
over 100,000 independent runs was 23.8 for AdaUCB and 29.9 for Thompson 
sampling. The figure below shows that contribution of the second moment of 
Rn = X; A:T; (n) for each algorithm, which shows that Thompson sampling has 
a much larger variance than AdaUCB, despite its inferior expected regret. 
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Linear Bandits 


While the advantages of Thompson sampling in finite-armed bandits are relatively 
limited, in the linear setting there is much to be gained, both in terms of 
computation and empirical performance. Let A C R? and (€,B(E),Q, P) be a 
Bayesian bandit environment where € C R? and for 0 € € and a € A, Poa is 
1-subgaussian with mean (0,a). Let 0 : E — R@ be the identity map, which is a 
random vector on (E, B(E), Q). 


1: Input Bayesian bandit environment (E, B(E), Q, P) 
2: for tE 1,...,n do 

3 Sample 0, from the posterior 

4: Choose A; = argmaxc a (a, 0t) 

5 Observe X; 

6: end for 


Algorithm 25: Thompson sampling for linear bandits. 


The Bayesian regret is controlled using the techniques from the previous section 
in combination with the concentration analysis in Chapter 20. A frequentist 
analysis is also possible under slightly unsatisfying assumptions, which we discuss 
in the notes and bibliographic remarks. 


THEOREM 36.4. Assume that ||0||2 < S with Q-probability one and supac 4 |lall2 < 
L and supaca |(a,0)| < 1 with Q-probability one. Then the Bayesian regret of 
Algorithm 25 is bounded by 


202 
BR» < 2-2 pins log (1 + n ) 


27,2 
where 6 = 1 + {zt + dios (1+ ne ), 


For fixed S and L, the upper bound obtained here is of order 
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O(dy/nlog(n) log(n/d)), which matches the upper bound obtained for Lin-UCB 
in Corollary 19.3. 


Proof We apply the same technique as used in the proof of Theorem 36.1. Define 
the upper confidence bound function U; : A —> R by 


t 
a 1 
Ur(a) = (a, 61-1) + Bllally-1, where i= sy I+ ZAA ' 
By Theorem 20.5 and Eq. (20.9), P(exists t < n : ||ĝ—1 — llv, > 8) < 1/n. Let 
E; be the event that ||6:-1—4||v,_, < 6, E = Ni; Er and A* = argmaxc 4 (a, 6). 
Note that A* is a random variable because 0 is random. Then 


BRn =E |X_(A* — At, 0) 
t=1 
= tee Sota — Ae 4 [be Sota = Ae 
t=1 t=1 
cove [lesan 
t=1 
<2+ | Seta = Ae (36.7) 
t=1 
Let Fi = o(Ay, X1,..., At, Xt) and let Ut ıl] = [- | F: i]. As before, 


P(A* = -| Fi-1) = P (A =-|Fi-1), and U;(a), for any fixed a € A, is Fy_1- 
measurable, and so E,-1(U;(A*)) = Ex_1(U;(Az)). It follows that the second term 
in the above display is bounded by 


be—1 [Iz (A* — Az, 9)] = Tz, Ey-1 [(A*, 0) — Ux(A*) + Ue (At) — (At, 0)] 
< Tz, Ei [Ui(At) — (At, 9) 

< Tis Et [Ai 81-1 — 9)] + Bll Ally, 

< TeBe |As- — lly. + BI Adlly,-» 
< 2p | Ally i 


Substituting this combined with Ip,(A* — A;,0) < 2 into the second term of 
Eq. (36.7), we get 


J bs Ip, (A* — At, 9) 
t=1 


< 26E pa ^ lly) 


t=1 


< 4 np?E pa A lag. (Cauchy-Schwarz) 


t=1 


20/2 
< njome flow (: i ne )| . (Lemma 19.4) 
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Putting together the pieces shows that 


272 
BR,, < 2+2y[2dn9*to (1+ Ro ) l 


Computation 


An implementation of Thompson sampling for linear bandits needs to sample 0; 
from the posterior and then find the optimal action for the sampled parameter: 


A; = argmaxc (a, 0t) - 


For some priors and noise models, sampling from the posterior is straightforward. 
The most notable case is when Q is a multivariate Gaussian and the noise is 
Gaussian with a known variance. More generally, there is a large literature devoted 
to numerical methods for sampling from posterior distributions. Having sampled 
0ı, finding A; is a linear optimisation problem. By comparison, LinUCB needs to 
solve 


A, = argmax,e a mald; 0), 
€ 


which for large or continuous action sets is often intractable. 


Information Theoretic Analysis 


We now examine a Bayesian version of the adversarial k-armed bandit. As we 
will see, the natural generalisation of Thompson sampling is still a reasonable 
algorithm. Recall that in the adversarial bandit model studied in Part III, the 
adversary secretly chooses a matrix z € [0,1]"** at the start of the game and 
the reward in round t is x;,4,. In the Bayesian set-up, there is a prior probability 
measure Q on [0,1]"** with the Borel o-algebra. At the start of the game, a 
reward matrix X is sampled from Q, but not revealed. The learner then chooses 
actions (A;)/_, and the reward in round t is X;,4,. Formally, let m = (7)?_, bea 
policy and X € [0,1]"** and (A;)?_, be random elements on some probability 
space (Q,F,P) such that: 


(a) the law of X is Px = Q; and 
(b) P(A; € -| X, Hi1) = Til . | Ay_1) with H, = (Ai, X14,5- oe , Ar, Xtra): 


The inclusion of X in the conditional expectation in Part (b) implies that 
P(A; ee | X, Hı) = P(A: E. | Hı) i 


which means that A; and X are conditionally independent given H;_1. This is 
consistent with our definition of the model where X is sampled first from Q and 


36.4 Information Theoretic Analysis 469 


then A; depends on X only through the history H,;_1;. The optimal action is 
A* = argmaxge x) yo, Xta with ties broken arbitrarily. The Bayesian regret is 


BR, = [stan - xia) . 


t=1 


Like in the previous sections, Thompson sampling is a policy m = (m;)#_, that 
plays each action according to the conditional probability that it is optimal, which 
means the following holds almost surely: 


Til ý | Aj, X1A,, eas »Ar—1, Xt-1A,_1) = P(A* GN | A1, X1A,, EEN , ÁÅt—1;, Xt-1A,_1) . 
The main result of this section is the following theorem: 


THEOREM 36.5. The Bayesian regret of Thompson sampling for Bayesian k-armed 
adversarial bandits satisfies 


BRn < y knlog(k)/2. 


The proof is done through a generic theorem that is powerful enough to analyse 
a wide range of settings. For stating this result, we need some preparation. Let 
Fi = o(A1, X14,,---, At, Xtra, ) and Eil] = Ef- | Fi] and P,(-) = P(-| Fi). Let 
At = Xia» — Xtra, denote the immediate regret of round t. 

The promised generic theorem bounds the regret in terms of an ‘information 
ratio’ that depends on the ratio of the squared expected instantaneous regret 
conditioned on the past and a Bregman divergence with respect to some convex 
function F to be chosen later. 


THEOREM 36.6. Let F : Rë + RU {oo} be conver, and suppose there exists a 
constant 3 > 0 such that 


ilâ] < VBE: [Dr (P,(A* = -), P,- (4* =-))] a.s. 


Then BRn < ynpdiamp(Ppk-1). 


Proof Let M: = Pi(A* = -) € Pk-1. Using the directional derivative definition 
of the Bregman divergence combined with Fatou’s lemma and convexity of F, 


n1 [Dr(M:, Mz_1)] = Er-1 [FM — F(Mi-1) — Vm, -m F(Mi1)] 


=E his nt (Pn) - Fv) ie = 
Slane ( u- [ran — P(M,») - AC Mes L 
= Ey, POA rO + mint E OE 
< Er- [F(M;)] — (Mp1) + lim inf PUSE; Na + hMi)) 
= E; [F(Mz)] — F(Mi-1), (36.8) 
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where the first inequality follows from Fatou’s lemma and the second from the 
convexity of F. The last equality is because E,_,[M;] = Mz_-1. Hence, 


BR, =E SoA < |S VBE: DA ET] 
t=1 t=1 
<| 8nE |X E, {Dee a] < V/Bndiam p(Px-1), 
t= 1. 


where the first inequality follows from the assumption in the theorem, the second 
by Cauchy—Schwarz, while the third follows by Eq. (36.8), telescoping and the 
definition of the diameter. 


It remains to choose F and show that the condition of the previous result can 
be met. As you might have guessed, a good choice is the unnormalised negentropy 
potential F(p) = y Pa log(Pa) — Pa. Remember that in this case the resulting 
Bregman divergence Dp (p,q) is the relative entropy, D(p, q), between categorical 
distributions parameterised by p and q, respectively. 


LEMMA 36.7. If Xu € [0,1] almost surely for all t € |n] andi € [k] and A, is 
chosen by Thompson sampling using any prior, then 


br—1 [At] < NE t-1[D(P;(A* € -), Pe-1(4* € -))]. 


Proof Given a measure P, we write Px\y(-) for P(X €- |Y). In our application 
below, X is a random variable, and hence P(X € -|Y ) can be chosen to be a 
probability measure by Theorem 3.11. When Y is discrete, we write Px)y=,(-) 
for P(X € -|Y = y). The result follows by chaining Pinsker’s inequality and 
Cauchy—Schwarz: 


Ut [At] = 5 Pi—1 (A = a) ( Ut 1[Xta | AX = al — E; 1[Xtal]) 
a=1 
$ I 
< P,_1 (Ay = —~D(P saa Pe 
<}, t-11 (A a)y: (P1, Xia|4*=a; Pt-1, Xa) 
< ES P(A = a)? DO, 1, Keel A* =a Pt—1,Xra) 
s \ 2 = —iL 5 ta =a? > ta 
nA k 
<4) 5 >) Pii (4 = a) XO Pra (A* = a) D(Pi 1, Xea |A" =b Pt, x20) 


\ 
= rE 5:1 [D(P:(A* € -), Px_1(A* € -))], 


where the final equality follows from Bayes’ law and is left as an exercise. 


= 
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Proof of Theorem 36.5 The result follows by combining Lemma 36.7, Theo- 
rem 36.6 and the fact that the diameter of the unnormalised negentropy potential 
is diamr(Pr—1) = log(k). 


The reason for the name ‘information-theoretic’ is that historically 
Theorem 36.6 was specified to the unnormalised negentropy when the 
expected Bregman divergence is called the information gain or mutual 
information. In this, sense Theorem 36.6 shows that the Bayesian regret is 
well controlled if E;_;[A;] can be bounded in terms of the information gain 
about the optimal action, which seems rather natural. Other potentials can 
be useful, however, as you will show in Exercise 36.10. 


Notes 


1 There are several equivalent ways to view Thompson sampling in stationary 


stochastic multi-armed bandits: (a) select an arm according to the posterior 
probability that the arm is optimal, or (b) sample an environment from the 
posterior and play the optimal action in that environment. When the mean 
rewards for each arm are independent under the posterior, then also equivalent 
is (c) sample the mean reward for each arm and choose the arm with the largest 
mean (Exercise 36.1). The algorithms in this chapter are based on (b), but all 
are equivalent and simply correspond to sampling from different push-forward 
measures of the posterior. Historically it seems that Thompson [1933] had 
the form in (a) in mind, but there are reasons to remember the alternative 
views. Though we are not aware of an example, in some instances beyond 
finite-armed bandits, it might be more computationally efficient to sample 
from a push-forward of the posterior than the posterior itself. Furthermore, in 
more complicated situations like reinforcement learning, it may be desirable 
to ‘approximate’ Thompson sampling, and approximating a sample from each 
of the above three choices may lead to different algorithms. It is also good to 
keep in mind that in the non-Bayesian setting there can be cheaper ways of 
inducing sufficient exploration than sampling from a posterior, especially in 
the context of structured bandit problems. 

Thompson sampling is known to be asymptotically optimal in a variety of 
settings — most notably, when the noise model follows a single-parameter 
exponential family and the prior is chosen appropriately [Kaufmann et al., 
2012b, Korda et al., 2013]. Unfortunately, Thompson sampling is not a silver 
bullet. The linear variant in Section 36.3 is not asymptotically optimal by the 
same argument we presented for optimism in Chapter 25. Characterising the 
conditions under which Thompson sampling is close to optimal remains an 
open challenge. 
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3 For the Gaussian noise model, it is known that Thompson sampling is not 
minimax optimal. Its worst-case regret is R, = O(,/nk log(k)) [Agrawal and 
Goyal, 2013al. 

4 An alternative to sampling from the posterior is to choose in each round 
the arm that maximises a Bayesian upper confidence bound, which is a 
quantile of the posterior. The resulting algorithm is called BayesUCB and 
has excellent empirical and theoretical guarantees [Kaufmann et al., 2012a, 
Kaufmann, 2018]. 

5 The prior has a significant effect on the performance of Thompson sampling. 
In classical Bayesian statistics, a poorly chosen prior is quickly washed away by 
data. This is not true in (stochastic, non-Bayesian) bandits because if the prior 
underestimates the quality of an arm, then Thompson sampling may never 
play that arm with high probability and no data is ever observed. We ask you 
to explore this situation in Exercise 36.16. 

6 An instantiation of Thompson sampling for stochastic contextual linear bandits 
is known to enjoy near-optimal frequentist regret. In each round the algorithm 
samples 0 ~ N (1-1, rV,_1), where r = ©(d) is a constant and 


t t 
Vi=I+S° AA] and & S= +Y XsAs. 
s=1 u= 

Then A; = argmaXxą,ca, (0ta). This corresponds to assuming the noise is 
Gaussian with variance r and choosing prior Q = N (0,I). Provided the 
rewards are conditionally 1-subgaussian, the frequentist regret of this algorithm 
is Ra = O(d?/?\/n), which is worse than LinUCB by a factor of Vd. The 
increased regret is caused by the choice of noise model, which assumes the 
variance is r = O(d) rather than r = 1. The reason to do this comes from the 
analysis, which works by showing the algorithm is ‘optimistic’ with reasonable 
probability. Very recently, an example was constructed showing that the blowup 
of the variance is necessary. The most recent version of this result establishes 
this for an action set with three (fixed) actions [Hamidi and Bayati, 2020]. 
Empirically, r = 1 often leads strong performance on many instances, though 
clearly, as shown by the results of Hamidi and Bayati [2020], this depends on 
what instances are used. 

7 The analysis in Section 36.4 can be generalised to structured settings such as 
linear bandits [Russo and Van Roy, 2016]. For linear bandits with an infinite 
action set, the entropy of the optimal action may be infinite. The analysis 
can be corrected in this case by discretising the action set and comparing to 
a near-optimal action. This leads to a trade-off between the fineness of the 
discretisation and its size, and when the trade-off is resolved in an optimal 
fashion, one obtains an upper bound of order O(d,/nlog(1+n/d)) on the 
Bayesian regret, slightly improving previous analysis. The reader is referred to 
the recent article by Dong and Van Roy [2018] for this analysis. 

8 The information-theoretic ideas in Section 36.4 suggest that rather than 


No} 


10 


11 
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sampling A; from the posterior on A*, one can sample A; from the distribution 
P, given by 


Dia Pa (E1 [Xta | A* = a] — Er1 [Xtal 
Dk PaB [Dr (Pi (A* = | Xia), Pea (4* =) 


When F is the unnormalised negentropy, the resulting policy is called 
information-directed sampling. Bayesian regret analysis for this algorithm 
follows along similar lines to what was presented in Section 36.4. See 
Exercise 36.9 or the paper by Russo and Van Roy [2014a] for more details. 

The proof of Theorem 36.6 only used the fact that M, = P,(A* = -) isa 
martingale. The posterior is just one possible choice, but in some cases an 


P, = argminnep,_, 


alternative martingale leads to improved bounds. 

Replacing the unnormalised negentropy potential with F(p) = —2 ean Di 
leads to a bound of BR, < V2nk for any prior for finite-armed bandits 
[Lattimore and Szepesvari, 2019c]. You will prove this in Exercise 36.10. 
The same potential also led to minimax bounds for adversarial bandits in 
Exercise 28.15, which suggests there is some kind of connection. This was 
explored by Zimmert and Lattimore [2019], who show that the same techniques 
used to bound the dual norm ‘stability’ terms in the analysis of mirror descent 
also control the information ratio for a version of Thompson sampling. 

Let € = [0,1]"** be the set of all adversarial bandits and II the set of all 
randomised policies and Q be the set of all finitely supported distributions on 
E, which means that Q € Q is a function Q : E > [0,1] with Supp(Q) = {z: 
Q(x) > 0} a finite set and >) csupp(q) Q(T) = 1. Given z € E and 7 € II, let 
Esr be the expectation with respect to the interaction between policy m and 
environment x. Then, 


n 
R*(E) = min sup E a i 
n(E) = min sup Esr ge Sot nu] 


Adversarial regret 


= sup min 5 Q(x)Ezz E (Tri — 7) (36.9) 
t=1 


ell 
geet xeSupp(Q) 


Bayesian optimal regret 
< Vnklog(h)/2, 


where the second equality follows from Sion’s minimax theorem (Exercise 36.11) 
and the inequality follows from Theorem 36.5. This bound is a factor of two 
better than what we gave in Theorem 11.2 and can be improved to V2nk using 
the argument from the previous note and Exercise 36.10. The approach has 
been used in more sophisticated settings, like the first near-optimal analysis 
for adversarial convex bandits [Bubeck et al., 2015a, Bubeck and Eldan, 2016] 
or partial monitoring [Lattimore and Szepesvari, 2019c]. As noted earlier, the 
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main disadvantage is that the technique does not lead to algorithms for the 
adversarial setting. 


Bibliographic Remarks 


Thompson sampling has the honor of being the first bandit algorithm and 
is named after its inventor [Thompson, 1933], who considered the Bernoulli 
case with two arms. Thompson provided no theoretical guarantees, but argued 
intuitively and gave hand-calculated empirical analysis. It would be wrong to 
say that Thompson sampling was entirely ignored for the next eight decades, 
but it was definitely not popular until recently, when a large number of authors 
independently rediscovered the article/algorithm [Graepel et al., 2010, Granmo, 
2010, Ortega and Braun, 2010, Chapelle and Li, 2011, May et al., 2012]. The 
surge in interest was mostly empirical, but theoreticians followed soon with regret 
guarantees. For the frequentist analysis, we followed the proofs by Agrawal and 
Goyal [2012, 2013a], but the setting is slightly different. We presented results for 
the ‘realisable’ case where the pay-off distributions are actually Gaussian, while 
Agrawal and Goyal use the same algorithm but prove bounds for rewards bounded 
in [0,1]. Agrawal and Goyal [2013a] also analyse the Beta/Bernoulli variant of 
Thompson sampling, which for rewards in [0,1] is asymptotically optimal in 
the same way as KL-UCB (see Chapter 10). This result was simultaneously 
obtained by Kaufmann et al. [2012b], who later showed that for appropriate 
priors, asymptotic optimality also holds for single-parameter exponential families 
[Korda et al., 2013]. For Gaussian bandits with unknown mean and variance, 
Thompson sampling is asymptotically optimal for some priors, but not others — 
even quite natural ones [Honda and Takemura, 2014]. The Bayesian analysis of 
Thompson sampling based on confidence intervals is due to Russo and Van Roy 
[2014b]. Recently the idea has been applied to a wide range of bandit settings 
[Kawale et al., 2015, Agrawal et al., 2017] and reinforcement learning [Osband 
et al., 2013, Gopalan and Mannor, 2015, Leike et al., 2016, Kim, 2017]. The 
BayesUCB algorithm is due to Kaufmann et al. [2012a], with improved analysis 
and results by Kaufmann [2018]. The frequentist analysis of Thompson sampling 
for linear bandits is by Agrawal and Goyal [2013b], with refined analysis by 
Abeille and Lazaric [2017a] and a spectral version by Kocák et al. [2014]. A recent 
paper analyses the combinatorial semi-bandit setting [Wang and Chen, 2018]. 
The information-theoretic analysis is by Russo and Van Roy [2014a, 2016], while 
the generalising beyond the negentropy potential is by Lattimore and Szepesvari 
[2019c]. As we mentioned, these ideas have been applied to convex bandits [Bubeck 
et al., 2015a, Bubeck and Eldan, 2016] and also to partial monitoring [Lattimore 
and Szepesvari, 2019c]. There is a tutorial on Thompson sampling by Russo 
et al. [2018] that focuses mostly on applications and computational issues. We 
mentioned there are other ways to configure Algorithm 24, for example the recent 
article by Kveton et al. [2019]. 
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Exercises 


36.1 (EQUIVALENT VIEWS) Prove the claimed equivalences in Note 1. 


36.2 (FILLING IN STEPS IN THE PROOF OF THEOREM 36.1 (1)) Consider the 
event E defined in Theorem 36.1, and prove that P (E°) < 2nkô. 


36.3 (FILLING IN STEPS IN THE PROOF OF THEOREM 36.1 (11)) Prove Eq. (36.1). 


36.4 (REMOVING LOGARITHMIC FACTORS) Improve the bound in Theorem 36.1 
to show that BR, < CVkn where C > 0 is a universal constant. 


HINT Replace the naive confidence intervals used in the proof of Theorem 36.1 
by the more refined confidence bounds used in Chapter 9. The source for this 
result is the paper by Bubeck and Liu [2013]. 


36.5 (FILLING IN STEPS IN THE PROOF OF THEOREM 36.2) Let G(s) = 
1— F;s (m = £). Show that 


(a) Ņ\ HA = asd {G,(s—1) > 1/n}; and 
teT 
(b) | Suter) <E|S01/n}. 
t¢T tT 


36.6 (FREQUENTIST BOUND FOR THOMPSON SAMPLING) In this exercise you 
will prove Theorem 36.3. 


(a) Show that there exists a universal constant c > 0 such that 
= 1 c 1 
J ——~ -1]| < —1 = hes 
Sl) <o%(2) 


eee ) 
<G a 


(c) Use Theorem 36.2 and the fundamental regret decomposition (Lemma 4.5) 
to prove Theorem 36.3. 


(b) Show that 


+ o(log(n)) . 


[ue > 1/n} 


HINT For (a) you may find it useful to know that for y > 0, 


2 2/2 
1o02 [2 Sa, 
yt vy +4 
where ®(y) = Tz f” exp(—2?/2)dz is the cumulative distribution function of 
the standard Cis [Abramowitz and Stegun, 1964, §7.1.13]. 


36.7 Prove the final equality in the proof of Lemma 36.7. 
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36.8 (PREDICTION WITH EXPERT ADVICE) Consider the adversarial Bayesian 
framework from Section 36.4, but assume the learner observes the whole vector 
X+ rather than just X;4,, which corresponds to the prediction with expert advice 
setting. Prove that Thompson sampling in this setting has a Bayesian regret of 
at most 


BR, < Vnlog(k)/2. 


36.9 (INFORMATION-DIRECTED SAMPLING) Prove that for any prior such that 
Xt € [0,1] almost surely, the Bayesian regret of information-directed sampling 
(see Note 8) satisfies 


BRn < Vknlog(k)/2. 


36.10 (MINIMAX BAYESIAN REGRET FOR THOMPSON SAMPLING) Prove that for 
any prior over adversarial k-armed bandits such that X;; € [0,1] almost surely, 
the Bayesian regret of Thompson sampling satisfies BR, < /2kn. 


Hint Use the potential F(p) = —2 Ta „Pi and the fact that the total 
variation distance is upper-bounded by the Hellinger distance. 


36.11 (FROM BAYESIAN TO ADVERSARIAL REGRET) Let € = {0,1}"** and Q 
be the space of probability measures on €. Prove that 


R* (£) = sup BR*(Q). 
QEQ 
HıNT Repeat the argument in the solution to Exercise 34.16, noting that Q is 
finite dimensional. Take care to adapt the result in Exercise 4.5 to the adversarial 
setting. 


36.12 (FROM BAYESIAN TO ADVERSARIAL REGRET) Let E = [0,1]"**. Prove 
that 


R(E) = sup BR, (Q), 
QEQ 


where Q is the set of probability measures on (£, 2°) with finite support. 


Hint That E is uncountably large introduces some challenges. Like in the 
previous exercise, the idea is to express the regret of a policy as an integral 
over the regret of deterministic policies, which can be viewed as functions 
m : UL, [0,1]! — [k]. Use Tychonoff’s theorem to argue that the space of 
all deterministic policies is compact with respect to the product topology. Then 
the space of regular probability measures over deterministic policies is compact 
with the weak* topology by Theorem 2.14. Then carefully check continuity 
and linearity of the Bayesian regret, and apply Sion’s theorem. Details are by 
Lattimore and Szepesvari [2019c]. 


36.13 (BINARY IS THE WORST CASE) Prove that R*({0,1}"**) = R*({0,1]"**). 
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Hint. Think about how to use a minimax optimal policy for {0,1}"** for 
bandits in [0,1]"™*. 


36.14 (IMPLEMENTATION (1)) In this exercise, you will reproduce the results in 
Experiment 1. 


(a) Implement Thompson sampling as described in Theorem 36.3 as well as 
UCB and AdaUCB. 

(b) Reproduce the figures in Experiment 1 as well as UCB. 

(c) How consistent are these results across different bandits? Run a few 
experiments and report the results. 

(d) Explain your findings. Which algorithm do you prefer and why. 


36.15 (IMPLEMENTATION (11)) Implement linear Thompson sampling with a 
Gaussian prior as defined in Note 6 as well as LinUCB from Chapter 19 and 
Algorithm 12. Compare these algorithms in a variety of regimes, report your 
results, and tell an interesting story. Discuss the pros and cons of different choices 
ofr. 


36.16 (MISSPECIFIED PRIOR) Fix a Gaussian bandit with unit variance and 
mean vector u = (0,1/10) and horizon n = 1000. Now consider Thompson 
sampling with a Gaussian model with known unit covariance and a prior on the 
unknown mean of each arm given by a Gaussian distribution with mean pp and 
covariance o} I. 


(a) Let the prior mean be up = (0,0), and plot the regret of Thompson sampling 
as a function of the prior variance o2. 

(b) Repeat the above with wp = (0,1/10) and (0,—1/10) and (2/10, 1/10). 

(c) Explain your results. 


Part VIII 
Beyond Bandits 
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Partial Monitoring 


While in a bandit problem, the feedback that the learner receives from the 
environment is the loss of the chosen action, in partial monitoring the coupling 
between the loss of the action and the feedback received by the learner is loosened. 

Consider the problem of learning to match 
pennies when feedback is costly. Let c > 0 be 
a known constant. At the start of the game, the 
adversary secretly chooses a sequence ti1,...,in € 
{HEADS, TAILS}. In each round, the learner chooses 
an action A; € {HEADS, TAILS, UNCERTAIN}. The 
loss for choosing action a in round t is 


Figure 37.1 Spam filtering 
is a potential application of 
Yta = 4c, if a = UNCERTAIN; partial monitoring. The turtle 
(called Spam) was inherited by 
one of the authors. 


0, ifa=%; 


1, otherwise. 


So far this looks like a bandit problem. The difference is that the learner 
never directly observes y,4,. Instead, the learner observes nothing unless 
A, = UNCERTAIN, in which case they observe the value of i;. As usual, the 
goal is to minimise the (expected) regret, which is 


n 
Rn = max E — ; 
n ael bs (Ytd, ma) 
How should a learner act in problems like this, where the loss is not directly 
observed? Can we find a policy with sublinear regret? In this chapter we give 
a more or less complete answer to these questions for finite adversarial partial 
monitoring games, which include the above problems as a special case. 


Matching pennies with costly feedback seems like an esoteric problem. But 
think about adding contextual information and replace the pennies with 
emails to be classified as spam or otherwise. The true label is only accessible 
by asking a human, which replaces the third action. While the chapter does 
not cover the contextual version, some pointers to the literature are added 
at the end. 


37.1 


37.1.1 


37.1 Finite Adversarial Partial Monitoring Problems 480 


Finite Adversarial Partial Monitoring Problems 


A finite, k-action, d-outcome adversarial partial monitoring problem is 
specified by a loss matrix £ € R**¢ and a feedback matrix & € ©**¢, where 
X is called the set of signals. We let m be the maximum number of distinct 
symbols in any row of ®. At the beginning of the game, the learner is given £ and 
®, and the environment secretly chooses n outcomes 71,...,%, with i; € [d]. The 
loss of action a € [k] in round t is Yta = Lai. In each round t, the learner chooses 
A, € [k] and receives feedback o; = ®4,;,. Given a partial monitoring problem 
G = (®, L), the regret of policy m when the adversary chooses i:n = (#4), is 


Rn (T, tin, G) = max E È (Yta, — ma) : 


a€ [k] rere 
We omit the arguments of R, when they can be inferred from the context. 


To reduce clutter, we slightly abuse notation by using (e;) to denote the 
standard basis vectors of Euclidean spaces of potentially different dimensions. 


Examples 


The partial monitoring framework is rich enough to model a wide variety of 
problems, a few of which are illustrated in the examples that follow. Many of the 
examples are quite artificial and are included only to highlight the flexibility of 
the framework and challenges of making the regret small. 


EXAMPLE 37.1 (Hopeless problem). Some partial monitoring problems are 
completely hopeless in the sense that one cannot expect to make the regret 
small. A simple example occurs when k = d = 2, m = 1 and 


L= , = (37.1) 


Note that rows/columns correspond to choices of the learner/adversary, 
respectively. In both rows, the feedback matrix has identical entries for both 
columns. As the learner has no way of distinguishing between different sequences 
of outcomes, there is no way to learn and avoid linear regret. The reader is 
encouraged to think of generalisations of this example where the game is still 
hopeless. 
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Two feedback matrices ® € D**4 and @ € D**4 encode the same information 
if the pattern of identical entries in each row match. For example, 


i A > A o 
=|1 2 2 and =o Q Q 
giil & vv 


both encode the same information. Note that for these matrices m = 2 since 
in any row there are at most two distinct symbols. 


EXAMPLE 37.2 (Trivial problem). Just as there are hopeless problems, there are 
also trivial problems. This happens when one action dominates all others as in 
the following problem: 


L= y = 
1 1 


In this game the learner can safely ignore the second action and suffer zero regret, 
regardless of the choices of the adversary. 


EXAMPLE 37.3 (Matching pennies). The penny-matching problem mentioned in 
the introduction has k = 3 actions d = 2 outcomes and is described by 


0 1 
L=ļ|1ı of. = ; (37.2) 
c Cc H T 


Matching pennies is a hard game for c > 1/2 in the sense that the adversary can 
force the regret of any policy to be at least Q(n?/ 3). To see this, consider the 
randomised adversary that chooses the first outcome with probability p and the 
second with probability 1 — p. Let € > 0 be a small constant to be chosen later 
and assume p is either 1/2+ or 1/2 — €. The techniques in Chapter 13 show that 
the learner can only distinguish between these environments by playing the third 
action about 1/2? times. If the learner does not choose to do this, then the regret 
is expected to be Q(ne). Taking these together shows the regret is lower-bounded 
by Ry = Q(min(ne, (c — 1/2 + €)/e?)). Choosing ¢ = n~/3 leads to a bound 
of Ra = 2((c — 1/2)n?/3). Notice that the argument fails when c < 1/2. We 
encourage you to pause for a minute to convince yourself about the correctness of 
the above argument and to consider what might be the situation when c < 1/2. 


EXAMPLE 37.4 (Bandits). Finite-armed adversarial bandits with binary losses 
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can be represented in the partial monitoring framework. When k = 2, this is 
possible with the following matrices: 


0 1 0 1 0 1 0 1 
L= 


0 0 1 1 001 1 


The number of columns for this game is 2%. For non-binary rewards, you would 
need even more columns. A partial monitoring problem where ® = £ can be 
called a bandit problem because the learner observes the loss of the chosen action. 
In bandit games, Exp3 from Chapter 11 guarantees a regret of O(,/kn log(k)), 
and as noted there, a more sophisticated algorithm will also remove the log(k) 
factor. If you completed Exercise 15.4 then you will know that, up to a constant 
factor, Vkn is also the best possible regret in adversarial bandits with binary 
losses. 


EXAMPLE 37.5 (Full information problems). One can also represent problems 
where the learner observes all the losses. With binary losses and two actions, we 
have 


0 101 1234 
8 
0 0 11 1234 


Like for bandits, the size of the game grows quickly as more actions /outcomes 
are added. A partial monitoring game where ®,; = i for all a € [k] and i € [d] 
can be called full information because the signal reveals the losses for all actions. 


EXAMPLE 37.6 (Dynamic pricing). A charity worker is going door to door selling 
calendars. The marginal cost of a calendar is close to zero, but the wages of the 
door knocker represents a fixed cost of c > 0 per occupied house. The question 
is how to price the calendar. Each round corresponds to an attempt to sell a 
calendar, and the action is the seller’s asking price from one of d choices. The 
potential buyer will purchase the calendar if the asking price is low enough. Below 
we give the corresponding matrices for case where both the candidate asking prices 
and the possible values for the buyer’s private valuations are {$1, $2, $3, $4}: 


c—1 c-1 c-1 c-i Y Y Y Y 

c c—2 c—2 c-2 N Y Y Y 
L= 5 = 

C c c—3 c—3 N N Y Y 

C C c c—4 N N N Y 


Notice that observing the feedback is sufficient to deduce the loss so the problem 
could be tackled with a bandit algorithm. But there is additional structure in 
the losses here because the learner knows that if a calendar did not sell for $3, 
then it would not sell for $4. 
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The Structure of Partial Monitoring 


The minimax regret of partial monitoring problem G = (£, ®) is 


Rž (G) = inf max Rn (7, tin, G). 


T Vin 


One of the core questions in partial monitoring is to understand the growth of 
R*(G) as a function of n for different games. We have seen examples where 


R*(G) =0 (Example 37.2) 
R*(G) = O(n"/?) (Example 37.4) 
R*(G) = O(n?) (Example 37.3) 
R} (G) = Q(n). (Example 37.1) 


The main result of this chapter is that there are no other options. A partial 
monitoring game is called trivial if R* (G) = 0, easy if Rž (G) = O(n'/), hard 
if R* (G) = O(n?/3) and hopeless if R* (G) = O(n). Furthermore, we will show 
that any game can be classified using elementary linear algebra. 

What makes matching pennies hard and bandits easy? To get a handle on this, 
we need a geometric representation of partial monitoring games. The next few 
paragraphs introduce a lot of new terminology that can be hard to grasp all at 
once. At the end of the section, there is an example illustrating the concepts 
(Example 37.10). 


The Geometry of Losses and Actions 


The geometry underlying partial monitoring comes from viewing the problem as 
a linear prediction problem, where the adversary plays on the (d — 1)-dimensional 
probability simplex and the learner plays on the rows of £. Define a sequence of 
vectors (uz) by ut = e; and let la E€ R? be the ath row of matrix £. The loss 
suffered in round t when choosing action a is Yta = Za, Ut). 

Let u = + y us E€ Pa-ı be the probability vector of proportions of 
the adversary’s choices over t rounds. An action a is optimal in hindsight if 
(La, Un) < mingza (ly, Un). The cell of an action a is the subset of P4—ı on which 
it is optimal: 


Cy = fu € Pa-1 : max (la — ly, u) < o} ; 
be[k] 


which is a convex polytope. The collection {Ca : a € [k]} is called the cell 
decomposition of Pa—1. Actions with Ca = 0 are called dominated because 
they are never optimal, no matter how the adversary plays. For non-dominated 
actions we define the dimension of an action to be the dimension of the affine 
hull of Ca. Readers unfamiliar with the affine hull should read Note 4 at the 
end of the chapter. A non-dominated action is called Pareto optimal if it has 
dimension d — 1, and degenerate otherwise. Actions a and b are duplicates if 


E 
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La = &. The set of all Pareto optimal actions is denoted by II C [k]. A partial 
monitoring game is called degenerate if it has any degenerate or duplicate actions. 


Dominated and degenerate actions can never be uniquely optimal in hindsight, 
but their presence can make the difference between a hard game and a 
hopeless one. Consider the matching pennies game (Example 37.3). When 
c > 1/2, the third action is dominated, but without it the learner would 
suffer linear regret. Duplicate actions are only duplicate in the sense that 
they have the same loss. They may have different feedback structures and 
so cannot be trivially combined. 


Neighbourhood relation 
Pareto optimal actions a and b are neighbours if Ca N Cp has dimension d — 2. 
Note that if a and b are Pareto optimal duplicates, then Ca N Cp has dimension 
d — 1, and the definition means that a and b are not neighbours. For Pareto 
optimal action a we let Ma be the set consisting of a and its neighbours. Given 
a pair of neighbours e = (a,b), we let Ne = Nap = {c € [k] : Ca N Co C Ce} to 
be the set of actions that are incident to e. The neighbourhood relation defines 
an undirected graph over [k] with edges E = {(a,b) : a and b are neighbours}, 
which is called the neighbourhood graph. 

The next result, which shows the connectedness of the neighborhood graph 
induced by a set of actions whose cells cover the whole simplex, will play an 
important role in subsequent proofs: 


LEMMA 37.7. Suppose that S is any set of Pareto optimal actions such that 
UacsCa = Pa—-ı. Then the graph with vertices S and edges from E is connected. 


Let e = (a,b) € E. The next lemma characterises actions in Me as either a, b, 
duplicates of a,b or degenerate actions c for which £e is a convex combination of 
la and &. The situation is illustrated when d = 2 in Fig. 37.2. 


LEMMA 37.8. Let e = (a,b) € E be neighbouring actions and c € Ne be an action 
such that le € {la, ly}. Then 


(a) there exists an a € (0,1) such that le = ala + (1 — a)l; 
(b) Ce = Ca N Cb; and 
(c) c has dimension d — 2. 


Proof We use the fact that if ¥ C Y C R? and dim( X) = dim(y), then aff(¥) = 
aff (V) (Exercise 37.2). Introduce ker’ (x) = {u € R? : u! x =0,u'1 = 1}. Clearly, 
CaN Co C CaN Ce and aff (CaN Co) = ker’ (la — lo) and aff (Ca N Ce) = ker’ (la — le). 
By assumption dim(Ca N Cb) = d — 2. Since Ca N Cy C Ca N Ce, it holds that 
dim(Ca NCe) > d—2. Furthermore, dim(Ca N Ce) < d—2, since otherwise le = la. 
Hence dim(Ca N Ce) = d — 2 and thus by the fact mentioned and our earlier 
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| (£3, (u, 1 — u)) 
ie 
n 2 
g N G 
f] R ag 
= N 
w zy 
w 
2 > 
0 1 
uU 


Figure 37.2 The figure shows the situation when d = 2 and 4, = (1,0) and Z2 = (0,1) 
and £3 = (1/2,1/2). The x axis corresponds to Pı = [0,1], the y axis to the losses. 
Then Cı = [0,1/2] and C2 = [1/2,1], which both have dimension 1 = d — 1. Then 
C3 = {1/2} = C1 N C2, which has dimension 0. 


findings, ker'(la — 4%) = ker’ (la — le). This implies (Exercise 37.3) that 0, — ¢, 
is proportional to a — le so that (1 — a)(la — €y) = la — le for some a Æ 1. 
Rearranging shows that 


le = alg + (1 = a)l. 


Now we show that a € (0,1). First note that a ¢ {0,1} since otherwise 
le E€ {la, fy}. Let u € Ca be such that (a,u) < (%&, u), which exists since 
dim(C,) = d — 1 and dim(Ca N C,) = d — 2. Then 


(lau) < (bc, u) = alla, u) + (1 a) (0, u) = (la, u) + (a— 1) (la — h, u) , 


which by the negativity of (Za — %, u) implies that a < 1. A symmetric argument 
shows that œ > 0. For (b), it suffices to show that Ce C Ca N Ce. By de Morgan’s 
law, for this it suffices to show that Pa_i \ (CaM Cy) C Pa_i \ Ce. Thus, pick 
some u € Pa-1 \ (Ca N Cy). The goal is to show that u ¢ Ce. The choice of u 
implies that there exists an action e such that (la — le, u) > 0 and (4 — Le, u) > 0 
with a strict inequality for either a or b (or both). Therefore, using the fact that 
a € (0,1), we have 


Ueu) = alla, u) + (1 — a) (ly, u) > Ueu), 


which by definition means that u ¢ C., completing the proof of (b). Finally, (c) 
is immediate from (b) and the definition of neighbouring actions. 


Estimating Loss Differences 


In order to achieve small regret, the learner needs to identify an optimal action. 
How efficiently this can be done depends on the loss and feedback matrices. An 
initial observation is that since the loss matrix is known, the learner can restrict 
the search for the optimal action to the Pareto optimal actions. Furthermore, 
by Lemma 37.7, it suffices to estimate the loss differences between neighbours 
and then chain the estimates together along a connecting path. The second 
important point is that to minimise the regret the learner only needs to estimate 
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the differences in losses between Pareto optimal actions and not the actual losses 
themselves. In fact, there exist games for which estimating the actual losses is 
impossible, but estimating the differences is straightforward: 


EXAMPLE 37.9. Consider the partial monitoring game with 


0 1 10 11 Ob O e 
Le > = 


1 0 11 10 & 0 SAV 


The learner can never tell if the environment is playing in the first two columns or 
the last two, but the differences between the losses of actions are easily deduced 
from the feedback no matter the outcome and the action. 


Only the loss differences between Pareto optimal actions need to be estimated. 
There are games that are easy, but where some loss differences cannot be 
estimated. For example, there is never any need to estimate the losses of a 
dominated action. 


Having decided we need to estimate the loss differences between neighbouring 
Pareto optimal actions, the next question is how the learner can do this. Focusing 
our attention on a single round, suppose the adversary secretly chooses an outcome 
i € |d] and the learner samples an action A from distribution p € ri(P,_1) and 
observes o = ®y;. We are interested in finding an unbiased estimator of Lai — Lp; 
for neighbouring actions a and b. Without loss of generality, the estimator can 
be in the form of f(A,c)/p4 with some function f : [k] x & —> R. Then, the 
unbiasedness requirement takes the convenient form 


In other words, f(A,o)/pa is an unbiased estimator of Lai — Ly; regardless of 
the adversaries’ choice if and only if 


k 

NO f(c, Bei) = Lai — Loi for alli € [d]. (37.3) 

c=1 
A pair of neighbours a and b are called globally observable if there exists a 
function f satisfying Eq. (37.3). The set of all functions f : [k] x E > R satisfying 
Eq. (37.3) is denoted by &%°. A pair of neighbours a and bare locally observable 
if f can be chosen satisfying Eq. (37.3) with f(c, o) = 0 whenever c ¢ Nap. The set 
of functions satisfying this additional requirement are &°. A partial monitoring 
problem is called globally/locally observable if all pairs of neighbouring actions are 
globally/locally observable. The global/local observability conditions formalise 
the idea introduced in Example 37.3. Games that are globally observable but not 
locally observable are hard because the learner cannot identify the optimal action 
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by playing near-optimal actions only. Instead it has to play badly suboptimal 
actions to gain information, and this increases the minimax regret. 


EXAMPLE 37.10. The partial monitoring problem illustrated in Fig. 37.3 has six 
actions, three feedbacks and three outcomes. The cell decomposition is shown on 
the right with the 2-simplex parameterised by its first two coordinates u, and 
uz so that ug = 1 — ug — uy. Actions 1, 2 and 3 are Pareto optimal. There are 
no dominated actions while actions 4 and 5 are 1-dimensional and action 6 is 0- 
dimensional. The neighbours are (1,3) and (2,3), which are both locally observable, 
and so the game is locally observable. Note that (1,2) are not neighbours because 
the intersection of their cells is (d — 3)-dimensional. Finally, M3 = {1,2,3} and 
Mı = {1,3} and No3 = {2,3,4}. Think about how we decided on what losses to 
use to get the cell decomposition shown in Fig. 37.3. 


0 1 4 1 2 3 
1 0 1 
C4 
1/2 1/2 1/2 
L= / / / ie , Co > 
(6 —— ÈỌrrnnnnn...... 
3/4 1/4 3/4 i 2 3 
1 1/2 1/2 Cs—— 
Cı 
1 1/4 3/4 u 


Figure 37.3 Partial monitoring game with k = 6 and d = 3 and m = 3. 


37.3 Classification of Finite Adversarial Partial Monitoring 


The terminology in the last section finally allows us to state the main theorem of 
this chapter that classifies finite adversarial partial monitoring games. 


THEOREM 37.11. The minimas regret of partial monitoring problem G = (L, ®) 
falls into one of four categories: 


0, if G has no pairs of neighbouring actions; 
O(n!/?), if G is locally observable and has neighbouring actions; 
O(n?/3), if G is globally observable, but not locally observable ; 


Q(n), otherwise . 


LS The Landau notation is used in the traditional mathematical sense and 
obscures dependence on k, d, m and the finer structure of G = (£,®). 


37.4 
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The proof is split into parts by proving upper and lower bounds for each part. 
First up is the lower bounds. We then describe a policy and analyse its regret. 


Lower Bounds 


Like for bandits, the lower bounds are most easily proven using a stochastic 
adversary. In stochastic partial monitoring, we assume that u1,..., Un are 
chosen independently at random from the same distribution. To emphasise 
the randomness, we switch to capital letters. Given a partial monitoring game 
G = (£,®) and probability vector u € Py_1, the stochastic partial monitoring 
environment associated with u samples a sequence of independently and identically 
distributed random variables I,,...,[, with P (I; = i) = u; and U; = e7,. In each 
round t, a policy chooses action A; and receives feedback ot = ®4,1,. The regret 
is 


k- 


Rn(T, u) = max E pa — la, Ut) 


a€ [k] i 


= ne pa lai J . 
The reader should check that Rš (G) > infr maxyep,_, Rn(T, u), which allows 
us to restrict our attention to stochastic partial monitoring problems. Given 
u,q € Pa-1, let D(u, q) be the relative entropy between categorical distributions 
with parameters u and q respectively: 
d d 2 
D(u,q) = Y` ui log (=) <y a. (37.4) 
i=1 di i=1 di 

where the second inequality follows from the fact that for measures P,Q we have 
D(P, Q) < x?(P, Q) (see Note 6 in Chapter 13). 


THEOREM 37.12. Let G = (L£,®) be a globally observable partial monitoring 


problem that is not locally observable. Then there exists a constant cg > 0 such 
that R* (G) > can?”. 


Proof The proof involves several steps. Roughly, we need to define two alternative 
stochastic partial monitoring problems. We then show these environments are 
hard to distinguish without playing an action associated with a large loss. Finally 
we balance the cost of distinguishing the environments against the linear cost of 
playing randomly. Without loss of generality assume that © = [m]. 


Step 1: Defining the Alternatives 
Let a,b be a pair neighbouring actions that are not locally observable. Then, by 
definition, Ca N Cy is a polytope of dimension d — 2. Let u be the centroid of 
Ca N Cr and 

e= min (ke — la, U). (37.5) 


cNab 
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Figure 37.4 Lower-bound construction for hard partial monitoring problems. Shown is 
Pai, the cells Ca and C, of two Pareto optimal actions, and two alternatives Ua € Ca 
and u» € Cy that induce the same distributions on the outcomes under both a and b. 


The value of £ is well defined, since by global observability of G, but nonlocal 
observability of (a,b), there must exist some action c ¢ Na». Furthermore, since 
c € Nap, it follows that £ > 0. As in the lower-bound constructions for stochastic 
bandits, we now define two stochastic partial monitoring problems ua, uy by 
choosing a direction q € R and a small value A such that ug = u — Aq € Cy 
and up = u + Aq E O, (see Fig. 37.4). This means that action a is optimal if the 
environment plays ua on average and b is optimal if the environment plays up 
on average. The direction q will be chosen so that using a and b alone it is not 
possible to distinguish between ua and up. 

The vector q is chosen as follows: Since (a, b) are not locally observable, £!9¢ = 0. 
Equivalently, there does not exist a function f : [k] x E > R such that for all 
i € |d], 

5 f(c, Bai) = lai = loi G (37.6) 
cENab 


In this form, it does not seem obvious what the next step should be. To clear 
things up, we introduce some linear algebra. Let Se € {0,1}™*@ be the matrix 
with (Se)oi =1{®.; = o}, which is chosen so that S.e; = ea. Define the linear 
map S : R? > RWelm py 


Sa 

Sb 
sol N 

Se 


which is the matrix formed by stacking the matrices {5e : c € Nab}. Then, an 
elementary argument shows that there exists a function f satisfying Eq. (37.6) if 
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and only if there exists a w € RINl™ such that 
0, -4 =S'w 


In other words, actions (a, b) are locally observable if and only if la — 4% € im(S$"). 
Since we have assumed that (a,b) are not locally observable, we must have 
la —& ¢im(S'). Let z € im(S') and w € ker(S) be such that la — l = z+ w, 
which is possible since im(.S')@ker(.9) = R¢. Since 4a -—& ¢ im(S''), it holds that 
w Æ 0 and (la — lp, wW) = (z +w, w) = (w, w) Æ 0. Note also that 1 € im(S'') and 
hence (1, w) = 0. Finally, let q = w/(€, — l, w). By construction, q € R4, q 40 
while Sq = 0, (la — ,q) = 1 and (1,q) = 0. Let A > 0 be some small constant 
to be chosen subsequently. With this, we define ua = u — Aq and u, = u + Aq so 
that 


(lo = bas Ua) =A and (la = lp, ub) = Äe (37.7) 


We note that if A is sufficiently small, then ua € Ca and up € Cp because a and 
b are Pareto optimal. 


Step 2: Calculating the Relative Entropy 

Given action c and r € Pq_1, let Per be the distribution on the feedback observed 
by the learner when playing action c in stochastic partial monitoring environment 
determined by r. That is Per(0o) = (Ser)o. Further, let P, be the distribution 
on the histories H, = (A1, ®1,..., An, ®n) arising from the interaction of the 
learner’s policy with the stochastic environment determined by r. Expectations 
with respect to P, are denoted by E,. A modification of Lemma 15.1 shows that 


D(Pua, Pus) = XO Eu, [Te(n)] D(Peua Peu). (37.8) 
c€[k] 


By the definitions of ua and up, we have Scua = Scup for all c € Nay. Therefore, 
Peu, = Peu, and D(Peu,,Peu,) = 0 for all c € Nay. On the other hand, if 
c € Nap, then by the data processing inequality (Exercise 14.10) and Eq. (37.4), 
for A < minj-g,40 ui /(2lqil), 


2 
ES ee ee 


TG eng Peu ) < < D( (Ua, Up) < 
f = “i Alal 


d k 
i= 


1 
where we used that u E€ Ca N Cp is not on the boundary of Pa—1, so u; > 0 for all 
i and we defined C, as a suitably large constant that depends on u (q is entirely 
determined by a and b). Therefore, 


D(Pus Pa) < Ču XO Elf.(n)|A?. (37.9) 
c€Nab 


Step 3: Comparing the Regret 
By Eq. (37.5) and Hélder’s inequality, for c ¢ Nay we have (Ze — La, Ua) > 
e— (le — la, Aq) > £ — Allq||1 and (€. — y, uv) > e — Allq||1, where, for simplicity 
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and without the loss of generality, we assumed that the losses lie in [0,1]. Define 


T(n) to be the number of times an arm not in Ma» is played: 


c€Nab 
By Lemma 37.8, for each action c € Nap, there exists an a € [0,1] such that 
le = alg + (1 — a)l. Therefore, by Eq. (37.7), 


(fa E la, Ua) T (la = Lo, ub) = (1 = a) (lp E lazta) T alla E bo, Ub) = A, 
(37.10) 


which means that max((é, — la, Ua), (lc — l», u)) > A/2. Define T(n) as the 
number of times an arm in Mas is played that is at least A/2 suboptimal in ua: 


s A 
T(n) = I< Ze — la, Ua) > = ? Teln). 
m= > f ) > $ | rat) 
It also follows from (37.10) that if c € Nas and (le — La, Ua) < 4, then 
(Lle— bo, Uy) > 4. Hence, under uy, the random pseudo-regret, 5°, T-(n) (lc—£, ub), 
is at least (n — T(n))A/2. Assume that A is chosen sufficiently small so that 
Allq|la < ¢/2. By the above, 


Rp(t, Ua) + Rn(T, ub) 


= Eu, | XO Te(n) (bc — la, ta) | + Ew, | XO Teln) (Le — lo, uv) 


cE[k] celk] 
> EEn, [P0] +2 (Pu, (P) > 0/2) + Pu (Pn) <n/2)) 
> EE., [P(n)] + 2 exp CDPau. Pus) 
> : tua [T (n)] + E exp (Č, AEn, [T(n)]) , 


where the second inequality follows from the Bretagnolle-Huber inequality 
(Theorem 14.2) and the third from Eqs. (37.8) and (37.9). The bound is completed 
by choosing 


A g ( : Ui E ) 
=min| min ——, ———__ ] , 
i:qi 40 2|q;|° 2||q|]an4/3 


which is finite since q # 0. Straightforward calculation concludes the result 
(Exercise 37.7). 


We leave the following theorems as exercises for the reader (Exercises 37.8 
and 37.9): 


THEOREM 37.13. If G is not globally observable and has at least two non- 
dominated actions, then there exists a constant cg > 0 such that Rž (G) > cen. 


37.5 
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Proof sketch Since G is not globally observable, there exists a pair of 
neighbouring actions (a, b) that are not globally observable. Let u be the centroid 
of CaNCy. Let S € R*™*¢ be the stack of matrices from {Se : c € [k]}. Then, using 
the same argument as the previous proof, we have la — 4% ¢ im(S'). Now define 
q € RÊ such that (1,q) = 0, (a — %, q) = 1 and Sq = 0. Let A > 0 be sufficiently 
small and ua = u— Aq and uy = u+ Ag. Show that D(P.,, , Pu, ) = 0 for all policies 
and complete the proof in the same fashion as the proof of Theorem 37.12. 


THEOREM 37.14. Let G = (L£,®) be locally observable and have at least one pair 
of neighbours. Then there exists a constant ca > 0 such that for all large enough 
n the minimaz regret satisfies Rž (G) > cayn. 


Proof sketch By assumption, there exists a pair of neighbouring actions (a, b). 
Define u as the centroid of Ca N Cp and ua and up be the centroids of C, and 
Cy respectively. For sufficiently small A > 0, let va = (1 — A)u — Aua and 
vp = (1 — A)u+ Aw. Then 


d ND 
D(Py,,Pu,) < nð. (ai = vei) <cgnh?, 
i=1 


Ubi 


where cq > 0 is a game-dependent constant. Let A = 1/yn and apply the ideas 
in the proof of Theorem 37.12. 


Policy and Upper Bounds 


We now describe a policy for globally and locally observable games, and prove 
its regret is O(n/?) for locally observable games and O(n2/*) otherwise. For 
the remainder of this section, fix a globally observable game G = (£,®). The 
estimation functions in & ea and &°¢ are designed to combine with importance- 
weighting to estimate the loss differences between actions a and b. For this section, 
it is more convenient to define estimation functions for the whole loss vector up 
to constant shifts. Let &Y°° be the set of all functions f : [k] x © — R* such that: 


(a) f(a,o), = 0 for all b ¢ II; and 
(b) for each outcome i € |d], there exists a constant c € R with 
k 
XO f(a, Pai)o = Ly +c for all be IL. 


a=1 


The intuition is that &Y°° is the set of functions that serve as unbiased loss 
difference estimators in the sense that when A ~ p € ri(Px_1), then 


4 (ea — eb, f(A, Gas) 
PA 


= (la —,e;) for all Pareto optimal a,b and i € [d]. 


As we will see in the proof of Theorem 37.16, if G is globally observable, then 
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E&Y% is non-empty. By identifying functions and vectors, °° c REEM), a view 
that will be useful later. 

The policy for partial monitoring combines exponential weights with a careful 
exploration strategy. A little reminder about exponential weights and some new 
notation will be useful. Given a probability vector q € Pk—1, define a function 
Wy: R* > R by 

Vq(z) = (g,exp(—z) + z- 1), 


where the exponential function is applied component-wise. You might recognise 
W, as the Bregman divergence 


Vlz) = Dr«(VF(q) — z, VF(q)), 


where F is the unnormalised negentropy potential. Suppose that (G,)?_, is an 
arbitrary sequence of vectors with @ € R*, n > 0 and 


exp (=n Yai isa) 
k —-la 
ae exp (=n eae i») 


Recall from Theorem 28.4 that for any a* € [k], 


Qia = ’ t € [n]. 


n ek 


5 5 QialÎta ~ Uta) < Pee) ae : 5 Wo, (7G) : (37.11) 


Exp3 is derived by defining % as the importance-weighted loss estimator and 
sampling A; from Q;. We will do something similar in partial monitoring, but 
with two significant differences: (a) the importance-weighted estimator must 
depend on the feedback and loss matrices, and (b) the algorithm will sample 
A; from an alternative distribution P, that is optimised to balance the regret 
suffered relative to Q+ and the information gained. 


The definition of global observability does not imply that loss differences 
between dominated and degenerate actions can be estimated. Consequentially, 
the distribution Q+ used by the new algorithm will be supported on Pareto 
optimal actions only. The actual distribution P, used when choosing an 
action may also include degenerate actions, however. 


The optimisation problem for balancing information and regret explicitly 
optimises a worst case upper bound on the right-hand side of Eq. (37.11). For 
n > 0 and q € Px_1 with Supp(q) C II, let 


opt, (7) = _inf,, max =p- EESSI (ze), 


feev’ icid] Pa 
pEri(Pk—1) 


(37.12) 
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Of course, opt,(7) depends on the game G, which is hidden from the notation 
to reduce clutter. The first term in the right-hand side of Eq. (37.12) measures 
the additional regret when playing p rather than q, while the second corresponds 
to the expectation of the second term in Eq. (37.11) when the algorithm uses 
importance-weighting using estimation function f. The optimisation problem is 
convex and hence amenable to efficient computation (see Note 9 for some details). 
The worst-case value over all q is 


opt* (n) = sup{opt,(7) : ¢ € Pk-1, Supp(q) C IT}. 


The function q ++ opt,(7) is generally not convex, so opt*(7) may be hard to 
compute. This causes a minor problem when setting the learning rate, which can 
be mitigated by adapting the learning rate online as discussed in Note 7. 


We say that f € &Y°° and p €E ri(Px_1) solve Eq. (37.12) with precision 
e>Oif 
k 
1 T 1 nf (a, Bai) 

max |—(p—q) Le F > PaY oe < opt +e. (37.13 

max R Lei + = Dally (A a(n) +e. (87.13) 
Such approximately optimal solutions exist for any € > 0, but may not exist 
for e = 0 because the constraint on p is not compact. 


The convexity of the inner maximum in Eq. (37.12) can be checked using the 
following construction. The perspective of a convex function f : R° > R 
is a function g : Rt! — R given by 


g(a, u) = eee ifu>0; 


(37.14) 
CO, otherwise . 


The perspective is known to be convex (Exercise 37.1). Since W4 is convex 
and the max of convex functions is convex, it follows that the term inside of 
the infimum of Eq. (37.12) is convex. 


The full algorithm is given as Algorithm 26. 


THEOREM 37.15. For any > 0 and £ > 0, the regret of Algorithm 26 is bounded 
by 


Rn < 


< aa + nn(opt*(7) +). 


Proof The result follows from the definitions of €&Y°° and the regret, and the 
bound for exponential weights in Eq. (37.11). Let a* = argmingey X p1 (6a; Ut): 
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Compute exponential weights distribution Q+ € Px_-1 by: 


-l a 
(a) exp (=n Sie 


1: Input: 7, €, £ and ® 
2: for t€ 1,...,n do 
3: 
In 
Qta = 

4: 

fi E€ gye 
5: Sample A; ~ P, and observe 
6: Set Ji = fi( Ar, ot) / Prd, 
7: end for 


Ot 


= x 2 
J ocn XP (-n Si is») 


Solve Eq. (37.12) with q = Q; and precision € to find P, € P,_, and 


Algorithm 26: Exponential weights for partial monitoring. Recall that II denotes the set of 


Pareto optimal actions. 


Then, 


ap» 3 Qal L ait T 


t=1 a= 


) > 5 Qta(Gta E Îta*) 


Lat a] 


Lax *iz) 


+E 


a Olea 


+E 


a 
Il = 
= 


(Pra = Olea x 


The first expectation is bounded using the definition of Q: and Eq. (37.11) by 


n k 
log(k 1 
y bs 5 Qtal ta — Dta* ] < 3 Eo 
t=] a=1 4i n 
— log(k) 1 
1) n 


j A] 


[S3 D Fra¥a, 


t=1 a=1 


Se 2e) 


Combining the two displays, using the definitions of P,, fı and €+, and substituting 


the definition of opto, (7n) < opt*(7) completes the proof. 


The extent to which this result is useful depends on the behaviour of opt* (n) 
for different classes of games. The following two theorems bound the value of 
the optimisation problem for globally observable and locally observable games 
respectively. An apparently important quantity in the regret upper bounds for 
both globally and locally observable games is the minimum magnitude of the 


L 
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estimation functions. Let 


Velo = max min ||fllo and vioc = max min ||fllo- 
ecE fees e€E feeloc 


In the remainder of this chapter we assume that the losses are between zero 
and one £ € [0, 1]**?. 


THEOREM 37.16. For all globally observable games, opt* (n) < 2vgiok?/ y/n for all 


n < 1/ max{1, våok*}. 


THEOREM 37.17. For all locally observable games, opt* (ņ) < 9k? max(1, vj...) for 
all n < 1/(2k? max(1, vioc)). 


The proofs follow in subsequent sections. Combining Theorem 37.15 with 
Theorem 37.17 shows that for an appropriately tuned learning rate, the regret of 
Algorithm 26 on locally observable games is bounded by 


0 (viock*/? Vn log()) l 


By using Theorem 37.16, it follows that for globally observable games the regret 
is bounded by 


A= ((vsiokn)?/? (log(k))"/*) l 


These results establish the upper bounds in the classification theorem for locally 
and globally observable games. The quantities vg}, and vioc only depend on G but 
may be exponentially large in d. We walk you through the proof of the following 
proposition in Exercise 37.12. 


PROPOSITION 37.18. The following hold: 


(a) If G is globally observable, then Velo < Gi/2 54/2, 
(b) If G is locally observable, then Vioc < d}/2 44/2, 
(c) If G is locally observable and non-degenerate, then Vioc < m. 


The only property of non-degenerate games used in Part (c) is that |N.| = 2 
for all e € E. It is illustrative to bound opt*(7) for well-known games. The next 
proposition shows that Algorithm 26 recovers the usual bounds for bandits and 
the full information setting. 


PROPOSITION 37.19. The following hold: 


(a) For bandit games (® = L), opt*(n) < k/2. 
(b) For full information games (®,; = i for alla and i), opt*(n) < 1/2. 


You will prove this proposition in Exercise 37.14 by making explicit choices of 
p € ri(Pk-1) and f € e, 
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Proof of Theorem 37.16 


The definition of global (and local) observability is defined in terms of the existence 
of functions serving as unbiased loss estimators between pairs of neighbouring 
actions. To make a connection between &Y° and &®'° (and &!°°) we need the 
concept of an in-tree on the neighbourhood graph. Let S be a subset of Pareto 
optimal actions with no duplicate actions and UgesCy = Pa—-1. An in-tree on the 
graph (S, E) is a set of edges 7 C E such that (S,7) is a directed tree with all 
edges pointing towards a special vertex called the root, denoted by root; and 
such that V(7), the set of vertices underlying 7, is the same as S. Provided 
the game is non-trivial, then such a tree exists by Lemma 37.7. Given a Pareto 
optimal action b, let path;(b) C T denote the path from b to the root. The path 
is empty when b is the root. When b is not the root, we let par;(b) denote the 
unique Pareto optimal action such that (b, par;(b)) € T. 

Abbreviate v = vgio and let 7 C E be an arbitrary in-tree over the Pareto 
optimal actions. For each e € E, let fe € &8!° be such that || fella < v. Then 
define f : [k] x © > R* by 


fla, o) = 5 fe(a,o). 


e€path,(b) 


By the triangle inequality, maxgex),ceu || f(@,7) loo < kv. Furthermore, f € &Y°°, 


since for any outcome 2, 


S a, Pai) b = Ea 5 fe(a, Pai) = Loi = Lroot(T)i G 


a=1 a=1 e€path, (b) 


Let p = (1 — y)q + y1/k with y = vk? \/7. By the condition in the theorem that 
n <1/max{1,v7k*}, it holds that y < 1 and hence p € ri(Px_1). The next step 
is to bound the minimum possible value of the loss estimator. For actions a and 
b and outcome i, 
2 
nf (a, Pai) > nuk = Vi Soi 


Pa yY 


where in the final inequality we used the fact that 7 < 1. Next, using the fact 
that exp(—x) < z? +1 -— z for x > —1, it follows that for any z > —1, 


k 
<) oz, (37.15) 
b=1 


which is the inequality we have used long ago in Chapter 11. Using this, 


k 
pLa , (Tete) < EI hla, a ys See 


a= bent P T 


where we used that || (a,c) loo. < kv and p > y/k1 and that q € Pk-1. For the 
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other component of the objective, 


“(p —q)' Le; = = (1/k — q)" Le; = afk —q)' Le < “ 


Combining the previous two displays shows that for any i € [d], 


1 nf (a, pe) 2vk? 
—(p— )' Le; + > Pav 1 < ; 
A D Pa Jn 


which is the desired result. 


Proof of Theorem 37.17 


Exploiting local observability is not straightforward. To gain some insight let us 
consider the matching pennies game with c = 1/4: 


o_o" 
0 1 
1 ő 2 
L= 1 0 = 
1/4 1/4 H T 


The figure on the right-hand side is the neighbourhood graph. Notice that the third 
action is revealing and also separates the first two actions in the neighbourhood 
graph. Clearly, loss differences can be estimated between all pairs of neighbours 
in this graph, and hence the game is locally observable. Let’s suppose now that 
q = (1/2— €/2,1/2— £€/2,£) and p = q. The obvious estimation function f € &Y°° 
is given by 


(0,1,1/4)', ifa=3ando=1; 
f(a, o) = $ (1,0,1/4)!, toss and o = 2; 


0, otherwise . 


Examining the second term in Eq. (37.12) and using a second order Taylor 
approximation, 


3 3 
1 nf(a,Pai)\ 1 œn _ 1 , l-eE 
p gret Ta Pa EEDD de? 


which holds for both i = 1 and i = 2. This is bad news. The appearance of p = q3 


in the denominator means the objective can be arbitrarily large when € is small. 
Taylor’s theorem shows that the approximation is not to blame, provided that 
7 is suitably small. The main issue is that q and p assign most of their mass to 
two actions that are not neighbours and hence cannot be distinguished without 


37.7.1 
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playing a third action. Now suppose that p is constructed by transferring mass 
from the first two actions to the third by: 


p =q — min(q1, q2)(e1 + €2) + 2 min(q1, q2)e3 
The first observation is that this can only decrease the expected loss: 
3 o. 
(p -a)' £ = -7 min(q1,q2)1 < 0. 


This takes care of the first term in the objective. Let us assume without loss of 
generality that pı = max(pı, p2, p3), and let 


(0,1,1/4)', ifa=3andco=1; 
f(a,o) = $ (0,-1,-3/4)', ifa=3ando=2; 
0, otherwise . 


Using again a Taylor approximation suggests the second term in the objective is 
now well behaved: 


Dai ig 
2 aa (A a, ) Š 7 dep 3 Pais 


Pa 
<5(2+2) <5 (541). 
2 \p3 P3 2\2 


Things are starting to look more promising. By transferring the mass in q towards 
the revealing action and shifting the loss estimators to be zero on the most played 
action, we have gained control of the stability term and simultaneously decreased 
the expected loss of p relative to q. 


Duality and the Water Transfer Operator 


The water transfer operator, which we will introduce momentarily, provides the 
generalisation of the specific argument just given. The first step is an application 
of Sion’s minimax theorem (Theorem 28.12) to Eq. (37.12), which shows that 


opt,(7) = max inf pan ' LYA Lrt, (aista 2 5) ; 


AEPa-1 fee’ 
pEPk—1 


(37.16) 


By exchanging the max and the inf, we free ourselves from finding a distribution 
p and estimation function f such that the objective is controlled for all choices 
of the adversary. Now we only need to find a p and f for each distribution over 
outcomes A € Pa-1. 

Fix therefore an arbitrary distribution À € P4a—1. Let S be an arbitrary subset 
of Pareto optimal actions containing no duplicates and for which Uaes Ca = Pa-1 
and let 7 C E be an in-tree over S. Given an edge e = (a,b) € E, let 

e : Ne > [0,1] be the mapping such that e = (1 — ae(c))la + ae(c)& for 
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Root 


Degenerate 
actions 


— (Q) 


Figure 37.5 The large nodes are Pareto optimal actions in S. The smaller nodes inside 
are their duplicates, which are not part of S. The remaining nodes are degenerate 
actions that are linear combinations of Pareto optimal actions. The arrows indicate 
the in-tree. A vector y € R” is T-increasing if it is constant on duplicate actions and 
otherwise increasing in the direction of the arrows. In this case, the constraint is that 
Yı = Y2 = Y3 Sys Sys = Yo = y7 < ys < yo < yı2 and yio < yu < 412- 


all c € Ne, which exists by Lemma 37.8. Note that a-(c) = 0 when c is a 
duplicate of a and a¢(c) = 1 when it is a duplicate of b. A vector y € R* is called 
T-increasing if for all e = (a,b) € T and c,d € Ne with ae(d) > aelc), it holds 
that ya > yc. A vector y is called 7-decreasing if —y is 7-increasing. This concept 
is illustrated in Fig. 37.5. 


LEMMA 37.20. Given an in-tree T C E and distribution q E€ Pk—1 there exists a 
distribution r € Pk—ı such that 


(a) r> q/k; 
(b) r is T-increasing; and 


(c) (r— q, y) < 0 for all T-decreasing vectors y € R*. 


Proof For simplicity, we give the proof for the special case that all actions 
are Pareto optimal and there are no duplicates, in which case S = [k]. 
The proof is generalised in Exercise 37.11. Given an action a € [k], let 
ancr (a) = Ucepath;(a)Ne U {a} be the set of ancestors of a, including a and 
descy(a) = {b : a € ancy (b)} be the set of descendants of a. Define r by 


= qb 
a= 25. Ja 


bedesc7 (a) 


Let us first confirm that r € Pk-1. That r > 0 is obvious and 


k 
db q 
2 ra E> Taner OI oe? ace 


a=1 bedescry (a b=1 a€ancy (b) 
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For Part (a), the definition means that ra > qa/|anc7(a)| > qa/k. That r is 
T-increasing follows immediately from the definition. (c) follows because 


k 


(ru) =J ya 2. Bo > 5 Taer] T O 


a=1 bedescy a=1 bedesc7 (a 


The existence of the mapping q > r given by Lemma 37.21 was originally proven 
using a ‘water flowing’ argument and was called the water transfer operator. 


LEMMA 37.21. Let S as before. Then, for any A € Pa_1, there exists an in-tree 
T CE over vertices S such that LA is T-decreasing. 


Proof Again, we outline the argument for games with no degenerate or duplicate 
actions, leaving the complete proof for Exercise 37.11. Let a be an action such 
that A € Ca. First, assume that A € ri(C,). The root of our tree will be 

a (the reader may find helpful to check Fig. 37.6). Next, for b Æ a, define 
ar(b) = argmin,cy, €d LÀ and then let T = {(b,par(b)) : b # a}. Clearly, 
V(T) = [k]. Provided that 7 really is a tree, the fact that LA is T-decreasing is 
obvious from the definition of the parent function. That 7 is a tree follows by 
showing that for any (b,d) € T, e} LA < el LA, which we will prove now. For this, 
let w € ri(C,,) and c € M such that Ce N |w, A] # Ø. These exist by Exercise 37.10 
(see also Fig. 37.6). We now show that e! LA < ef LA from which the desired 
result follows. To show this let 


f(a) = (ep — ec) L((1 — a)w + ad). 


It suffices to show that f(1) > 0. The following hold: (a) f is linear; (b) f(0) < 0 
since w € ri(C,); and (c) there exists an a € (0,1) such that f(a) = 0, which 
holds because Ce N [w, A] # Ø and A € ri(Ca). Thus f(1) > 0, establishing the 
result for \ € ri(Ca). When A is on the boundary of C4, let (AC A be a sequence 
in ri(Ca) so that lim;—oo AM — A. For each i, let TË) C E be an in-tree such 
that LA is T“-decreasing. Since there are only finitely many trees, by selecting 
a subsequence we conclude that there exists an in-tree T C E such that LA” is 
T -decreasing for all 7. The result follows by taking the limit. 


This concludes the building of the tools needed to control opt,(7) for locally 
observable games. 


Proof of Theorem 87.17 Abbreviate v = vioc and let A € Pg_, be arbitrary. By 
Lemma 37.21, there exists an in-tree 7 C E over S such that LA is T-decreasing. 
Hence, by Lemma 37.20, there exists a T-increasing r € Pk—1 such that r > q/k 
and (r —q)'LA <0. Let p= (1 — y)r + y1/k with y = nuk? and 


f(a,0)o = DD fela, o), 


e€path, (b) 


where fe € &°° has ||fello < v. The same argument as in the proof of 
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Figure 37.6 The core argument used in the proof of Lemma 37.21. 


Theorem 37.17 shows that f € &¥°*. Moving to the objective in Eq. (37.16), we 
lower-bound the loss estimates: 


nf(a,o)o _ 7 S flao) mko (37.17) 


Pa Pa e€path., (b) Yy 


Fix i € [k]. The stability term is bounded using the properties of p and f as 
follows: 


k 
emo (ew) EL Eeva 


a= 1em P 


2 
k 


= | D flata) 


a= 1ben P e€path, (b) 


TODE Pl 5 Ier} 


ber e e€path(b) 


sg) eee 2r {a € Ucepath (0)Ne} 


Here, in the first inequality we used Eq. (37.15) and Eq. (37.17). The second 
inequality follows by the definition of v = vioc and the choice of fe € gs. and 
also because pa > Ta/2 by the condition on 7 in the theorem statement. The 
third since any action a is in Me for at most two edges in e € path7(b) (because 
V(7) c II and it has no duplicates). The fourth inequality is true since r is 
T-increasing and the fifth because r > q/k. Finally, by Part (c) of Lemma 37.20 
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and the fact that LA is 7-decreasing, 
1 1- 
“(p—q)'LA= = Viy =) LA + Z (1/k — q)! LA < k?u < k’ max(1, u°). 
1) n 


Combining the previous two displays shows that 


OAT d k ; 
BaD SND pet, (WAG) < 948 max, 0%). 
1 "ci Asi 


Pa 


Since the right-hand side is independent of A, the result follows from 
Eq. (37.16). 


Proof of the Classification Theorem 


Almost all the results are now available to prove Theorem 37.11. In Section 37.4, 
we showed that if G is globally observable and not locally observable, then 
R*(G) = Q(n?/8). We also proved that if G is locally observable and has 
neighbours, then R*(G) = (./n). This last result is complemented by the policy 
and analysis in Sections 37.5 to 37.7, where we showed that for globally observable 
games R*(G) = O(n?) and for locally observable games R*(G) = O(,/n). 
Finally we proved that if G is not globally observable, then Rž (G) = Q(n). All 
that remains is to prove that if G has no neighbouring actions, then Rž (G) = 0. 


THEOREM 37.22. If G has no neighbouring actions, then Rž (G) = 0. 


Proof Since G has no neighbouring actions, there exists an action a such that 
Ca = Pa_i and the policy that chooses A; = a for all rounds suffers no regret. 


Notes 


jas 


The next three notes are covering some basic definitions and facts in linear 
algebra. There are probably hundreds of introductory texts on linear algebra. 
A short and intuitive exposition is by Axler [1997]. 

2 A non-empty set L C R” is a linear subspace of R” if av + pw € L for 
alla,@ € R and v,w E L. If L and M are linear subspaces of R”, then 
L@M={v+w:LeL,we M}. The orthogonal complement of linear 
subspace L is L+ = {v € R” : (u,v) = 0 forall u € L}. The following 
properties are easily checked: (i) L4 is a linear subspace, (ii) (L+)+ = L and 
(iii) (LA M)+ = L+ ẹ M+. 

Let A € R™*” be a matrix and recall that matrices of this form correspond 
to linear maps from R” > R™ where the function A: R” — R” is given by 
matrix multiplication, A(x) = Ax. The image of A is im(A) = {Ax : x € R"}, 
and the kernel is ker(A) = {x € R” : Ax = 0}. Notice that im(A) C R™ and 
ker(A) C R”. One can easily check that im(A) and ker(A') are linear subspaces, 


w 
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and an elementary theorem in linear algebra says that im(A) @ ker(A') = R™ 
for any matrix A € R™*". Finally, if u € im(A) and v € ker(A'), then 
(u,v) =0. 


Given a set A C RI, the affine hull is the set 


j j 
af (A) = [Das :j>0, aE RÍ, x; € A for all i € [j] and Xoi = 7 f 
i=1 i=1 
Its dimension is the smallest m such that there exist vectors v1, ..., Um € R? 
such that af (A) = 2, +span(v1,...,Um) for any zo € A. 

We introduced the stochastic variant of partial monitoring to prove our lower 
bounds. Of course our upper bounds also apply to this setting, which means the 
classification theorem also holds in the stochastic case. The interesting question 


is to understand the problem-dependent regret, which for partial monitoring 
problem G = (£, ®) is 


> 


a€[k] 


Rp(t,u) = max E px — ba, Us) 


t=1 
where U, U1,...,Un is a sequence of independent and identically distributed 
random vectors with U; € {e1,...,ea} and E[U] = u € Pa-1. Provided G is 


not hopeless, one can derive an algorithm for which the regret is logarithmic, 
and like in bandits there is a sense of asymptotic optimality. The open research 
question is to understand the in-between regime where the horizon is not yet 
large enough that the asymptotically optimal logarithmic regret guarantees 
become meaningful, but not so small that minimax is acceptable. 

More generally, a stochastic partial monitoring problem by a probability 
kernel (Poa: 0 E€ O,a € A) from (O x A, F ®G) to (© x R,H ® B(R)). The 
environment chooses 6 € ©, and the learner chooses (A;)?_, with A; € A and 
observes (o;)?_, in a sequential manner, where (o+, X+) ~ Po_a,(-). The reward 
X, of round t is unobserved. As before, the learner’s goal is to maximise the 
total expected reward or, equivalently, to minimise regret. The special case 
of the previous note is has been studied under the name of finite stochastic 
partial monitoring. 

The optimal tuning of the bound for Algorithm 26 depends on opt*(7), which 
may be hard to compute. A simple way to address this problem is to use an 
adaptive learning rate: 


_ Jl log(k) 
= min i — ; 
ý BV 14+ dive 


where V, = max{0,optg,(m)} and B is chosen large enough that 7 is 
sufficiently small to satisfy the conditions needed in either Theorem 37.16 
or Theorem 37.17. An excessively large B only affects the regret in an additive 
fashion. The adaptive algorithm only needs to solve the optimisation in 
Eq. (37.12) and not opt*(7). Another benefit of the adaptive algorithm is 


10 


11 


12 
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that it only depends on the game through the constant B. Furthermore, the 
bound depends on (V;)?_,, rather than opt*(7), which may sometimes be 
beneficial. The analysis of the algorithm uses the same techniques as developed 
in Exercise 28.13 and is given by Lattimore and Szepesvari [2019d]. 
Algorithm 26 can be modified in several ways. One enhancement is to drop the 
constraint that f € &Y°° in the optimisation problem and introduce the worst 
case bias of f as a penalty. Certainly this does not make the bounds worse. A 
more significant change is to introduce a moment-generating function into the 
optimisation problem, which leads to high-probability bounds [Lattimore and 
Szepesvari, 2019d]. 

The optimisation problem in Algorithm 26 is convex and can be solved using 
standard solvers when k and d are small and 77 is not too small. When 17 is small 
and/or k or d is large, then numerical instability is a real challenge. One way to 
address this issue is to approximate the exponential in the definition of Y4 with 
a quadratic and add constraints on p and f that ensure the approximation is 
reasonable. Since the analysis uses p and f satisfying these conditions, none of 
the theory changes. What is bought by this approximation is that the resulting 
optimisation problem becomes a second order cone program, rather than an 
exponential cone program, and these are better behaved. More details are in 
our paper: [Lattimore and Szepesvari, 2019d]. 

Partial monitoring has many potential applications. We already mentioned 
dynamic pricing and spam filtering. In the latter case, acquiring the true label 
comes at a price, which is a typical component of hard partial monitoring 
problems. In general, there are many set-ups where the learner can pay extra 
for high-quality information. For example, in medical diagnosis the doctor can 
request additional tests before recommending a treatment plan, but these cost 
time and money. Yet another potential application is quality testing in factory 
production where the quality control team can choose which items to test (at 
great cost). 

There are many possible extensions to the partial monitoring framework. We 
have only discussed problems where the number of actions/feedbacks/outcomes 
is potentially infinite, but nothing prevents studying a more general setting. 
Suppose the learner chooses a sequence of real-valued outcomes 71,...,%2, with 
i € [0,1]. In each round, the learner chooses A; € [k] and observes ® 4, (it), 
where ®, : [0,1] > X is a known feedback function. The loss is determined 
by a collection of known functions £a : [0,1] — [0,1]. We do not know of any 
systematic study of this setting. The reader can no doubt imagine generalising 
this idea to infinite action sets or introducing a linear structure for the loss. 
A pair of Pareto-optimal actions (a,b) are called weak neighbours if 
Ca N Cp #9 and pairwise observable if there exists a function g satisfying 
Eq. (37.3) and with g(c, f) = 0 whenever c ¢ {a,b}. A partial monitoring 
problem is called a point-locally observable game if all weak neighbours are 
pairwise observable. All point-locally observable games are locally observable, 
but the converse is not true. Bartók [2013] designed a policy for this type of 


13 
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game for which 


where £g > 0 is a game-dependent constant and kjoc is the size of the largest 
A C |k] of Pareto optimal actions such that NacaCa # Ø. Using a different 
policy, Lattimore and Szepesvari [2019a] have shown that as the horizon grows, 
the game-dependence diminishes so that 


n— oo n 


lim sup a < 8(2 + m) y 2kioc log(k) . 


Linear regret is unavoidable in hopeless games, but that does not mean there 
is nothing to play for. Rustichini [1999] considered a version of the regret that 
captures the performance of policies in this harsh setting. Given p € Pa-1 
define set Z(p) C Pa-1 by 


d 


T(p) = fa € Pa-1: X (p: — qil {Pai = f} = 0 for all a € [k] and f € m) ; 


i=l 


This is the set of distributions over the outcomes that are indistinguishable 
from p by the learner using any actions. Then define 


= Max mın . 
p) qET (p) a€| pZ ar % 


Rustichini [1999] proved there exist policies such that 


(eee 
li |= Y Lavi, — f(tin)| =0, 
jim, max bp» oc | 


where ùn = 1 X; €i € Pa-1 is the average outcome chosen by the adversary. 
Intuitively this means the learner does not compete with the best action in 
hindsight with respect to the actual outcomes. Instead, the learner competes 
with the best action in hindsight with respect to an outcome sequence that is 
indistinguishable from the actual outcome sequence. Rustichini did not prove 
rates on the convergence of the limit. This has been remedied recently, and we 
give some references in the bibliographic remarks. 


Partial monitoring is still quite poorly understood. With some exceptions, we 
do not know how the regret should depend on d, k, m or the structure of 
G. Lower bounds that depend on these quantities are also missing, and the 
lower bounds proven in Section 37.4 are surely very conservative. We hope this 
chapter inspires more activity in this area. The setting described in Note 13 is 
even more wide open, where the dependence on n is still not nailed down. 
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Bibliographical Remarks 


The first work on partial monitoring is by Rustichini [1999], who focussed on 
finding Hannan consistent policies in the adversarial setting. Rustichini shows 
how to reduce the problem to Blackwell approachability (see Cesa-Bianchi and 
Lugosi [2006]) and uses this to deduce the existence of a Hannan consistent 
strategy. Rustichini also used a refined notion of regret that allows one to 
distinguish between learners even in the case of hopeless games (see Note 13). 
The first non-asymptotic result in the setting of this chapter is due to Piccolboni 
and Schindelhauer [2001], who derive a policy with regret O(n3/4) for globally 
observable games. Cesa-Bianchi et al. [2006] reduced the dependence to O(n?/*) 
and proved a wide range of other results for specific classes of problems. The first 
O(n'/?) bound for non-degenerate locally observable games is due to Foster and 
Rakhlin [2012]. The classification theorem when d = 2 is due to Bartok et al. 
[2010] (extended version: Antos et al. [2013]). With the exception of degenerate 
games, the classification of adversarial partial monitoring games is by Bartók 
et al. [2014]. The case of degenerate games was resolved by the present authors 
[Lattimore and Szepesvari, 2019a]. The policies mentioned in Note 12 are due to 
Bartók [2013] and Lattimore and Szepesvari [2019a]. We warn the reader that 
neighbours are defined differently by Foster and Rakhlin [2012] and Bartók [2013], 
which can lead to confusion. Additionally, although both papers are largely 
correct, in both cases the core proofs contain errors that cannot be resolved 
without changing the policies [Lattimore and Szepesvari, 2019a]. Algorithm 26 
and its analysis is also by the present authors [Lattimore and Szepesvari, 2019d], 
which is a followup on an earlier information-theoretic analysis [Lattimore and 
Szepesvari, 2019c]. 

There is a growing literature on the stochastic setting where it is common to 
study both minimax and asymptotic bounds. In the latter case, one can obtain 
asymptotically optimal logarithmic regret for games that are not hopeless. We 
refer the reader to papers by Bartók et al. [2012], Vanchinathan et al. [2014] 
and Komiyama et al. [2015b] as a good starting place. As we mentioned, partial 
monitoring can model problems that lie between bandits and full information. 
There are now several papers on this topic, but in more restricted settings and 
consequentially with more practical algorithms and bounds. One such model is 
when the learner is playing actions corresponding to vertices on a graph and 
observes the losses associated with the chosen vertex and its neighbours [Mannor 
and Shamir, 2011, Alon et al., 2013]. A related result is in the finite-armed 
Gaussian setting where the learner selects an action A; € [k] and observes a 
Gaussian sample from each arm, but with variances depending on the chosen 
action. Like partial monitoring, this problem exhibits many challenges and is 
not yet well understood [Wu et al., 2015]. We mentioned in Note 13 that for 
hopeless games, the definition of the regret can be refined. A number of authors 
have studied this setting and proved sublinear regret guarantees. As usual, the 
price of generality is that the bounds are correspondingly a bit worse [Mannor 
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and Shimkin, 2003, Perchet, 2011, Mannor et al., 2014]. There has been some 
work on infinite partial monitoring games. Lin et al. [2014] study a stochastic 
setting with finitely many actions, but infinitely many outcomes and a particular 
linear structure for the feedback. Chaudhuri and Tewari [2016] also consider a 
linear setting with global observability and prove O(n?/* log(n)) regret using 
an explore-then-commit algorithm. Kirschner et al. [2020] study a version of 
information-directed sampling in partial monitoring setting with a linear feedback 
structure and finitely or infinitely many actions. 

One can also add context, as usual. The special case of stochastic finite 
contextual partial monitoring has been considered by Bartók and Szepesvari 
[2012]. In this version, the learner is still given the matrices (£, ®), but also a set of 
functions F that map a sequence (x+)+ of contexts to outcome distributions, with 
the assumption that the outcome in round t is generated from f(x+) with f € F 
unknown to the learner. A special case, apple tasting with context (equivalently, 
matching pennies with context) is the subject of the paper of Helmbold et al. 
[2000]. The aforementioned paper by Kirschner et al. [2020] also studies the 
contextual partial monitoring problem in a linear setting. 


Exercises 


37.1 (PERSPECTIVE) Prove that the perspective as defined in Eq. (37.14) is 
convex. 


37.2 (AFFINE SETS AND DIMENSION) Let ¥ C Y C R! and dim(2) = dim(Y). 
Prove that aff(¥) = aff (V). 


37.3 (MODIFIED KERNEL) Recall that ker'(x) = {u : u'a = 0 and u! 1 = 1}. 
Show that if ker'(x) = ker'(y) # 0 then z and y are proportional. 


37.4 (STRUCTURE OF EXAMPLES) Calculate the neighbourhood structure, cell 
decomposition and action classification for each of the examples in this chapter. 


37.5 (APPLE TASTING) Apples arrive sequentially from the farm to a processing 
facility. Most apples are fine, but occasionally there is a rotten one. The only way 
to figure out whether an apple is good or rotten is to taste it. For some reason 
customers do not like bite marks in the apples they buy, which means that tested 
apples cannot be sold. Good apples yield a unit reward when sold, while the sale 
of a bad apple costs the company c > 0. 


(a) Formulate this problem as a partial monitoring problem: determine £ and ®. 

(b) What is the minimax regret in this problem? 

(c) What do you think about this problem? Will actual farmers be excited about 
your analysis? 


37.6 (TWO-ACTION PARTIAL MONITORING GAMES ARE TRIVIAL, HOPELESS OR 
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EASY) Let G = (L, ®) be a partial monitoring game with k = 2 actions. Prove 
that G is either trivial, hopeless or easy. 


37.7 (COMPLETE LOWER BOUND FOR HARD GAMES) Complete the last step in 
the proof of Theorem 37.12. 


37.8 (LOWER BOUND FOR EASY GAMES) Prove Theorem 37.14. 
37.9 (LOWER BOUND FOR HOPELESS GAMES) Prove Theorem 37.13. 


37.10 Let a and b be non-duplicate Pareto optimal actions and AÀ € ri(C,). Show 
there exists an w € ri(C;,) and neighbour c of b such that Ce N [w, A] £ 0. 


HINT It may be useful to look at Fig. 37.6 to get some tips. The figure depicts 
a slightly different situation, but is still useful when it is changed a little. 
37.11 Generalise the proofs of Lemma 37.20 and Lemma 37.21 to handle duplicate 
and degenerate actions. 


37.12 Prove Proposition 37.18. 


Hint For Part (a), let S € R*”*¢ be obtained by stacking (9.)*_,, defined as 
in the proof of Theorem 37.12. Then argue that for globally observable games, Vd 
times the reciprocal of the smallest non-zero singular value of S is an upper bound 
ON Ug]o and then use the fact that S'S has integer-valued coefficients. Part (b) 
follows in a similar fashion. For Part (c), use a graph-theoretic argument. 


37.13 Let m = || = 2 and d = 2k — 1 and construct a globally observable game 
for which there exists a pair of neighbouring actions a, b for which 


min [f= C22, 
fess 


where Č > 0 is a universal constant. 
37.14 Prove Proposition 37.19. 


HıNT Find choices of p and f that reduce the algorithm to Exp3 and exponential 
weights respectively. 


37.15 (LOWER BOUND DEPENDING ON THE NUMBER OF FEEDBACKS) Consider 
G = (£, ®) given by 


1010... 1 0 

L= and 
0101: 0 1 
122 3 3 4 m—1 m-1 m 
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— opt, (n) 


Figure 37.7 The value of opt*(7) as a function of c in matching pennies (Example 37.3). 


(a) Show this game is locally observable. 


(b) Prove that for n > m, there exists a universal constant c > 0 such that 
R} (G) 2 e(m —1)/n. 


The source for previous exercise is the paper by the authors [Lattimore and 
Szepesvári, 2019a]. 


37.16 (DIVERGENCE DECOMPOSITION FOR PARTIAL MONITORING) Complete the 
necessary modification of Lemma 15.1 to show that Eq. (37.8) is true. 


37.17 (ALGORITHM FOR CLASSIFYING GAMES) Write a program that accepts as 
input matrices £ and ® and outputs the classification of the game. 


37.18 (IMPLEMENTATION (1)) Implement a solver for the optimisation problem 
in Eq. (37.12). Consider the matching pennies problem (Example 37.3). Let 
n = 1/100 and plot opt*(7) as a function of the cost c. Explain your results. 


HInT The convex optimisation problem in Eq. (37.12) seems to cause problems 
for some solvers (see Note 9 for some mitigating strategies). We assume that 
many libraries can be made to work. Our implementation used the splitting cone 
solver by O’Donoghue et al. [2016, 2017]. Your plot should resemble Fig. 37.7. 


37.19 (IMPLEMENTATION (I1)) In this exercise you will compare empirically or 
otherwise Algorithm 26 to exponential weights and Exp3 on full information and 
bandit games. Specifically: 


(a) For full information games, exponential weights behaves like Algorithm 26 
except that ĝe = y, and P; = Qi. Does the solution to the optimisation 


37.11 Exercises 511 


problem used by Algorithm 26 lead to the same loss estimators and 
distribution P,? 

(b) For bandits, Exp3 uses fia = Ytal {At = a} / Pia. Does Algorithm 26 end up 
using the same loss estimators? Does P; = Q+? 


HINT You can approach this problem by using your solution to Exercise 37.18 
and comparing values empirically. Alternatively, you can theoretically analyse 
Eq. (37.12) in these special cases. Some of these questions are answered by 
Lattimore and Szepesvari [2019d]. 
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Markov Decision Processes 


Bandit environments are a sensible model for many simple problems, but they do 
not model more complex environments where actions have long-term consequences. 
A brewing company needs to plan ahead when ordering ingredients, and the 
decisions made today affect their position to brew the right amount of beer in 
the future. A student learning mathematics benefits not only from the immediate 
reward of learning an interesting topic but also from their improved job prospects. 

A Markov decision process (MDP) is a simple way to incorporate long-term 
planning into the bandit framework. Like in bandits, the learner chooses actions 
and receives rewards. But they also observe a state, and the rewards for different 
actions depend on the state. Furthermore, the actions chosen affect which state 
will be observed next. 


Problem Set-Up 


An MDP is defined by a tuple M = (S, A, P,r, u). The first two items S and A 
are sets called the state space and action space, and S = |S| and A = |A] are 
their sizes, which may be infinite. An MDP is finite if S, A < oo. The quantity 
P = (P, : a € A) is called the transition function with P, : S x S > [0,1] 
so that P,(s,s’) is the probability that the learner transitions from state s to 
s’ when taking action a. The fourth element of the tuple is r = (ra : a € A), 
which is a collection of reward functions with ra : S > [0,1]. When the learner 
takes action a in state s, it receives a deterministic reward of ra(s). The last 
element is u € P(S), which is a distribution over the states that determines 
the starting state. The transition and reward functions are often represented by 
vectors or matrices. When the state space is finite, we may assume without loss 
of generality that S = [S]. We write P,(s) € [0,1]° as the probability vector with 
s'th coordinate given by P,(s, s’). In the same way, we let P, € [0,1]$*° be the 
right stochastic matrix with (P.)s,5' = Pa(s,s’). Finally, we view ra as a vector 
in [0,1]5 in the natural way. 

The interaction protocol is similar to bandits. Before the game starts, the initial 
state S; is sampled from u. In each round t, the learner observes the state S; € S, 
chooses an action A; € A and receives reward r4,(S;). The environment then 


= 
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samples S;,; from the probability vector P,,(S;), and the next round begins 
(Fig. 38.1). 


t = land sample Sı ~ pu 
Y 
Observe state S < 

Y 

Choose action Ay € A Increment t 

4 

Y 

Receive reward ra, (S+) a Update Si41 ~ Pa,(S:) 


Figure 38.1 Interaction protocol for Markov decision processes. 


Although the action set is the same in all states, this does not mean that 
P,(s) or ra(s) has any relationship to P,(s’) or rq(s’) for states s Æ s’. In 
this sense, it might be better to use an entirely different set of actions for 
each state, which would not change the results we present. And while we 
are at it, of course one could also allow the number of actions to vary over 
the state space. 


Histories and Policies 

Before considering the learning problem, we explain how to act in a known MDP. 
Because there is no learning going on, we call our protagonist the ‘agent’ rather 
than ‘learner’. In a stochastic bandit, the optimal policy given knowledge of the 
bandit is to choose the action with the largest expected reward in every round. 
In an MDP, the definition of optimality is less clear. 

The history H; = ($1, Aı,..., St-1, At—-1, S4) in round t contains the 
information available before the action for the round is to be chosen. Note 
that state S, is included in H;. The actions are also included because the agent 
may randomise. For simplicity the rewards are omitted because the all-knowing 
agent can recompute them if needed from the state-action pairs. 

A policy is a (possibly randomised) map from the set of possible histories 
to actions. Simple policies include memoryless policies, which choose actions 
based on only the current state, possibly in a randomised manner. The set 
of such policies is denoted by Il, and its elements are identified with maps 
m:AxS — [0,1] with )0,.,7(a|s) = 1 for any s E€ S so that r(a|s) is 
interpreted as the probability that policy 7 takes action a in state s. 
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A memoryless policy that does not randomise is called a memoryless 
deterministic policy. To reduce clutter, such policies are written as S > A 
maps, and the set of all such policies is denoted by Ipm. A policy is called a 
Markov policy if the actions are randomised and depend only on the round 
index and the previous state. These policies are represented by fixed sequences of 
memoryless policies. Under a Markov policy, the sequence of states (S1, S2,...) 
evolve as a Markov chain (see Section 3.2). If the Markov policy is memoryless, 
this chain is homogeneous. 


o «<———— trap state high reward state 


Figure 38.2 A Markov decision process with six states and two actions represented by 
solid and dashed arrows, respectively. The numbers next to each arrow represent the 
probability of transition and reward for the action respectively. For example, taking the 
solid action in state 3 results in a reward of 0, and the probability of moving to state 
4 is 3/5, and the probability of moving to state 3 is 2/5. For human interpretability 
only, the actions are given consistent meaning across the states (blue/solid actions 
‘increment’ the state index, black/dashed actions decrement it). In reality there is no 
sense of similarity between states or actions built into the MDP formalism. 


Probability Spaces 

It will be convenient to allow infinitely long interactions between the learner and 
the environment. In line with Fig. 38.1, when the agent or learner follows a policy 
m in MDP M = (S, A, P,r, u), such a never-ending interaction gives rise to a 
random process (S1, A1, S2, A2,...) so that for any s,’ € S,a E€ Aandt>1, 


(a) P(S1 = s) = p(s); 
(b) P(St+1 = s! | At, At) = Pa, (St, 8’); and 
(c) P(A; = a| H;) = q(a | Ay). 


Meticulous readers may wonder whether there exists a probability space (Q, F, P) 
holding the infinite sequence of random variables (S1, A1, S2, Ao,...) that satisfy 
(a)-(c). The Ionescu-Tulcea theorem (Theorem 3.3) furnishes us with a positive 
answer (Exercise 38.1). Item (b) above is known as the Markov property. Of 


ier 
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course the measure P depends on the policy, Markov decision process and the 
initial distribution. For most of the chapter, these quantities will be fixed and the 
dependence is omitted from the notation. In the few places where disambiguation 
is necessary, we provide additional notation. In addition to this, to minimise 
clutter, we allow ourselves to write P(- | S1 = s), which just means the probability 
distribution that results from the interconnection of 7 and M, while replacing u 
with an alternative initial state distribution that is a Dirac at s. 


Traps and the Diameter of a Markov Decision Process 

A significant complication in MDPs is the potential for traps. A trap is a subset of 
the state space from which there is no escape. For example, the MDP in Fig. 38.2 
has a trap state. If being in the trap has a suboptimal yield in terms of the 
reward, the learner should avoid the trap. But since the learner can only discover 
that an action leads to a trap by trying that action, the problem of learning while 
competing with a fully informed agent is hopeless (Exercise 38.28). 

To avoid this complication, we restrict our attention to MDPs with no traps. 
An MDP is called strongly connected or communicating if for any pair of 
states s,s’ € S, there exists a policy such that when starting from s there is a 
positive probability of reaching s’ some time in the future while following the 
policy. One can also define a real-valued measure of the connectedness of an MDP 
called the diameter. MDPs with smaller diameter are usually easier to learn 
because a policy can recover from mistakes more quickly. 


DEFINITION 38.1. The diameter of an MDP M is 


D(M) = max min E” [min{t>1:5;,=s}|5,=s']-1, 
s#s' wEllpm 


where the expectation is taken with respect to the law of Markov chain (S;)?2, 
induced by the interaction between m and M. 


A number of observations are in order about this definition. First, the order 
of the maximum and minimum means that for any pair of states a different 
policy may be used. Second, travel times are always minimised by deterministic 
memoryless policies, so the restriction to these policies in the minimum is 
inessential (Exercise 38.3). Finally, the definition only considers distinct states. 
We also note that when the number of states is finite, it holds that D(M) < o if 
and only if M is strongly connected (Exercise 38.4). The diameter of an MDP 
with S states and A actions cannot be smaller than log, (S) — 3 (Exercise 38.5). 


For the remainder of this chapter, unless otherwise specified, all MDPs are 
assumed to be strongly connected. 
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Optimal Policies and the Bellman Optimality Equation 


We now define the notion of an optimal policy and outline the proof that there 
exists a deterministic memoryless optimal policy. Along the way, we define what is 
called the Bellman optimality equation. Methods that solve this equation are the 
basis for finding optimal policies in an efficient manner and also play a significant 
role in learning algorithms. Throughout, we fix a strongly connected MDP M. 

The gain of a policy 7 is the long-term average reward expected from using 
that policy when starting in state s: 


1 
pe = lim =) E"[ra,(S:)|S1 =], 


where E” denotes the expectation on the interaction sequence when policy 7 
interacts with MDP M. In general, the limit need not exist, so we also introduce 


1 
pr =limsup — XO E" [ra (94) | S1 = 5], 


which exists for any policy. Of course, whenever pz exists we have p=? = pz. The 
optimal gain is a real value 


p“ = max sup J} , 
sES v 


where the supremum is taken over all policies. A m policy is an optimal policy 
if pọ" = p*1. For strongly connected MDPs, an optimal policy is guaranteed to 
exist. This is far from trivial, however, and we will spend the next little while 
outlining the proof. 


MDPs that are not strongly connected may not have a constant optimal 
gain. This makes everything more complicated, and we are lucky not to have 
to deal with such MDPs here. 


Before continuing, we need some new notation. For a memoryless policy 7, define 


P,(s,8')= X n(a|s)Pa(s,s') and rx(s)= X` m(als)ra(s). (38.1) 


acA acA 


We view P, as an S x S transition matrix and r, as a vector in RS. With this 
notation, P, is the transition matrix of the homogeneous Markov chain S$}, S2,... 
when A; ~ qT(- | S+). The gain of a memoryless policy 7 satisfies 


n 


1 
ea ER tga da (38.2) 
t=1 


where P* = limno + yo, PÉ! is called the stationary transition matrix, 
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the existence of which you will prove in Exercise 38.7. For each k € N, define 
k 
of) = FPH 0"). 


t=1 


For s € S, vw (s) gives the total expected excess reward collected by m when 
the process starts at state s and lasts for k time steps. The (differential) value 
function of a policy is a function vy : S — R defined as the Cesàro sum of the 
sequence (Pt (ra — p™))t>0, 
r 1 “ (k) *\—1 * 
vp = lim — Sou = (Q = Pr + Pt) — Pere. (38.3) 
k=1 

Note, the second equality above is non-trivial (Exercise 38.7). The definition 
implies that v,(s) — u;(s’) is the ‘average’ long-term advantage of starting in 
state s relative to starting in state s’ when following policy m. These quantities 
are only defined for memoryless policies where they are also guaranteed to exist 
(Exercise 38.7). The definition of P* implies that P*P, = P*, which in turn 
implies that P*v, = 0. Combining this with Eqs. (38.2) and (38.3) shows that 
for any memoryless policy 7, 


P + Un = Tr + Prva. (38.4) 
A value function is a function v : S > R, and its span is given by 


span(v) = max v(s) — min v(s). 


As with other quantities, value functions are associated with vectors in R5. A 
greedy policy with respect to value function v is a deterministic memoryless 
policy m, given by 


Ty(s) = argmaxc 4 Tals) + (Pals), v). 


There may be many policies that are greedy with respect to some value function v 
due to ties in the maximum. Usually the ties do not matter, but for consistency 
and for the sake of simplifying matters, we assume that ties are broken in a 
systematic fashion. In particular, this makes m, well defined for any value function. 
One way to find the optimal policy is as the greedy policy with respect to a 
value function that satisfies the Bellman optimality equation, which is 


p+ v(s) = max (rals) + (Pals), v)) for alls eS. (38.5) 


This is a system of S nonlinear equations with unknowns p € R and v € RS. 
The reader will notice that if v : S > R is a solution to Eq. (38.5), then so is 
v + c1 for any constant c € R, and hence the Bellman optimality equation lacks 
unique solutions. It is not true that the optimal value function is unique up to 
translation, even when M is strongly connected (Exercise 38.11). The v-part of 
a solution pair (p,v) of Eq. (38.5) is called an optimal (differential) value 
function. 
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THEOREM 38.2. The following hold: 


(a) There exists a pair (p,v) that satisfies the Bellman optimality equation. 
(b) If (p,v) satisfies the Bellman optimality equation, then p = p* and Ty is 
optimal. 


(c) There exists a deterministic memoryless optimal policy. 


Proof sketch The proof of part (a) is too long to include here, but we guide you 
through it in Exercise 38.10. For part (b), let (p,v) satisfy the Bellman equation 
and T* = T, be the greedy policy with respect to v. Then, by Eq. (38.2), 


n n 


x 1 1 
Tw ys = t—1 = li = t—1 _ ` = . 
G Dae aE ae OR Te) 


Next, let m be an arbitrary Markov policy. We show that p” < p1. The result is 
then completed using the result of Exercise 38.2, where you will prove that for any 
policy 7, there exists a Markov policy with the same expected rewards. Denote 
by m; the memoryless policy used at time t = 1,2,... when following the Markov 
policy a, and for t > 1, let PO = P,,...P,,, while for t = 0, let P& = I. Thus, 
PË (s, s') is the probability of ending up in state s’ while following m from state 
s for t time steps. It follows that ø" = limsup,,,., + X1 PËYr a. Fixt>1. 
Using the fact that m* is the greedy policy with respect to v gives 


PEDra = PED (ra, + Pr,v — Priv) 
< PED (rae + Pav — Prv) 
= PD (p1 +v — Pnv) 
=p 4P n- Py, 


Taking the average of both sides over t € [n] and then taking the limit shows 
that p” < p1, finishing the proof. Part (c) follows immediately from the first 


two parts. 


The theorem shows that there exist solutions to the Bellman optimality equation 
and that the greedy policy with respect to the resulting value function is an 
optimal policy. We need one more result about solutions to the Bellman optimality 
equation, the proof of which you will provide in Exercise 38.13. 


LEMMA 38.3. Suppose that (p, v) satisfies the Bellman optimality equation. Then 
span(v) < D(M). 


The map T : RS > RS defined by (Tv)(s) = maxac A Tals) + (P,(s),v) is 
called the Bellman operator. The Bellman optimality equation can be 
written as p1 +v = Tw. 
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Finding an Optimal Policy (£) 


There are many ways to find an optimal policy, including value iteration, policy 
iteration and enumeration. These ideas are briefly discussed in Note 12. Here 
we describe a two-step approach based on linear programming. Consider the 
following constrained linear optimisation problem: 
minimise 38.6 
pER,vERS Á ( ) 
subject to p+ v(s) > rals) + (Pa(s),v) for all s,a. 


Recall that a constrained optimisation problem is said to be feasible if the set 
of values that satisfy the constraints are non-empty. 


THEOREM 38.4. The optimisation problem in Eq. (38.6) is feasible, and if (p,v) 


is a solution, then p= p* 


is the optimal gain. 
Solutions (p,v) to the optimisation problem in Eq. (38.6) need not satisfy 
the Bellman optimality equation (Exercise 38.12). 


Proof of Theorem 38.4 Theorem 38.2 guarantees the existence of a pair (p*, v*) 
that satisfies the Bellman optimality equation: 


p +v*(s) = max rals) + (Pals), v*) for all s,a. 


Hence the pair (p*,v*) satisfies the constraints in Eq. (38.6) and witnesses 
feasibility. Next, let (p, v) be a solution of Eq. (38.6). Since (p*, v*) satisfies the 
constraints, p < p* is immediate. It remains to prove that p > p*. Let 7 = T, 
be the greedy policy with respect to v and m* be greedy with respect to v*. By 
Theorem 38.2, p* = p™ . Furthermore, 


Phot» < Pt, (ra + Pru — Parv) < Pt (p1 +v — Prev) = p1 + Ptv Piw. 


Summing over t shows that p*1 = limno 4 G Ptr» < pl, which completes 
the proof. 


Having found the optimal gain, the next step is to find a value function that 
satisfies the Bellman optimality equation. Let 5 € S, and consider the following 
linear program: 

minimise (v, 1) (38.7) 
veERS 
subject to p* + u(s) > ra(s) + (Pa(s), v) for all s,a 
v(5)=0. 


The second constraint is crucial in order for the minimum to exist, since otherwise 
the value function can be arbitrarily small. 
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THEOREM 38.5. There exists a state § E€ S such that the solution v of Eq. (38.7) 
satisfies the Bellman optimality equation. 


Proof The result follows by showing that € = v + p*1 — Tv = 0. The first 
constraint in Eq. (38.7) ensures that £ > 0. It remains to show that £e < 0. Let 
m* be an optimal policy and m be the greedy policy with respect to v. Then 


Piotras < Pha (ra + Pro — Prso) = Pt. (p*1 +v — e — Prev). 


Hence p*1 = p™ 1 < p*1— Pž.e and P*.e < 0. Since € > 0 and Pž. is right 
stochastic, P*.¢ = 0. Choose § to be a state such that P*.(s, 5) > 0 for some s € S, 
which exists because P*. is right stochastic. Then 0 = (P*.¢)(s) > P*.(s, 8)e(8) 
and hence ¢(5) = 0. It follows that ð = v — € also satisfies the constraints in 
Eq. (38.7). Because v is a solution to Eq. (38.7), (6,1) > (v,1), implying that 


(e,1) < 0. Since we already showed that £ > 0, it follows that £ = 0. 


The theorem only demonstrates the existence of a state § for which the solution 
of Eq. (38.7) satisfies the Bellman optimality equation. There is a relatively 
simple procedure for finding such a state using the solution to Eq. (38.6), but its 
analysis depends on the basic theory of duality from linear programming, which 
is beyond the scope of this text. More details are in Note 11 at the end of the 
chapter. Instead we observe that one can simply solve Eq. (38.7) for all choices 
of § and take the first solution that satisfies the Bellman optimality equation. 


Efficient Computation 


The linear programs in Eq. (38.6) and Eq. (38.7) can be solved efficiently under 
assumptions that will be satisfied in subsequent applications. 


The algorithm proposed in this subsection is guaranteed to run in polynomial 
time, which is a standard objective in theoretical computer science. Its 
practical performance, however, is usually much worse than alternatives that 
suffer from exponential running time in the worst case. These issues are 
discussed in Note 12 at the end of the chapter. 


The general form of a linear program is an optimisation problem of the form 
minimise (c, £) 
xER” 
subject to Ax > b, 


where c € R” and A € R™*” and b € R™ are parameters of the problem. This 
general problem can be solved in time that depends polynomially on n and m. 
When m is very large or infinite, these algorithms may become impractical, but 
nevertheless one can often still solve the optimisation problem in time polynomial 
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in n only, provided that the constraints satisfy certain structural properties. Let 
K c R” be convex, and consider the optimisation problem 


minimise (c, £) (38.8) 
subject to x € K. 


Algorithms for this problem generally have a slightly different flavour because K 
may have no corners. Suppose the following holds: 


(a) There exists a known R > 0 such that K C {x € R” : ||a]o < R} 

(b) There exists a separation oracle, which we recall from Chapter 27, is a 
computational procedure to evaluate some function ¢ on R” with d(x) = 
TRUE for x € K, and otherwise ¢(a) = u with (y, u) > (x,u) for all y € K 
(see Fig. 27.1). 

(c) There exists a ô > 0 and zo € R? such that {x € R” : ||x — zoll2 < ô} CK. 


Under these circumstances, the ellipsoid method accepts as input the size of the 
bounding sphere R, the separation oracle and an accuracy parameter £ > 0. Its 
output is a point x in time polynomial in n and log(R/(de)) such that x € K and 
(c, x£) < (c, £x*} +£, where x* is the minimiser of Eq. (38.8). The reader can find 
references to this method at the end of the chapter. 

The linear programs in Eq. (38.6) and Eq. (38.7) do not have bounded feasible 
regions because if v is feasible, then v + c1 is also feasible for any c € R. For 
strongly connected MDPs with diameter D, however, Lemma 38.3 allows us to 
add the constraint that ||v||ďo < D. If the rewards are bounded in [0, 1], then we 
may also add the constraint that 0 < p < 1. Together these imply that for (v, p) 
in the feasible region, 


II(o, v)II3 = 2 + loll? < 1+ Slol < 1+ SD’. 


Then set R = v1 + D?S. When the diameter is unknown, one may use a doubling 
procedure. In order to guarantee the feasible region contains a small ball, we 
add some slack to the constraints. Let € > 0, and consider the following linear 
program: 


minimise p (38.9) 
pER,vERS 


subject to € + p + v(s) > rals) + (Pa(s), v) for all s,a. 
v(s) > —D for all s 
v(s) < D for all s 
p<1+e forall s 
p2 —e for all s. 


Note that for any z in the feasible region of Eq. (38.9), there exists a y that is 
feasible for Eq. (38.6) with ||x — y|lo. < £. Furthermore, the solution to the above 


linear program is at most £ away from the solution to Eq. (38.6). What we have 


38.4 


38.4 Learning in Markov Decision Processes 522 


bought by adding this slack is that now the linear program in Eq. (38.9) satisfies 
the conditions (a) and (c) above. The final step is to give a condition when a 
separation oracle exists for the convex set determined by the constraints in the 
above program. Define convex set K by 


K ={(p,v) E RIH! : € + pt v(s) > rals) + (Pals), v) for all s,a}. (38.10) 
Assuming that 
argmaXac 4(Ta(s) + (Pals), v)) (38.11) 


can be solved efficiently, Algorithm 27 provides a separation oracle for K. For the 
specialised case considered later, Eq. (38.11) is trivial to compute efficiently. The 
feasible region defined by the constraints in Eq. (38.9) is the intersection of K with 
a small number of half-spaces. In Exercise 38.15, you will show how to efficiently 
extend a separation oracle for arbitrary convex set K to N; ; Hp N K, where 
(Hp); are half-spaces. You will show in Exercise 38.14 that approximately 
solving Eq. (38.7) works in the same way as above, as well as the correctness of 
Algorithm 27. 


In Theorem 38.2, we assumed an exact solution of the Bellman optimality 
equation, which may not be possible in practice. Fortunately, approximate 
solutions to the Bellman optimality equation with approximately greedy 
policies yield approximately optimal policies. Details are deferred to 
Exercise 38.16. 


1: function SEPARATIONORACLE(p, v) 

2 For each s € S find až € argmax,(ra(s) + (Pa(s), v)) 

3 if e+p+v(s) > rax(s) + (Pax(s),v) for all s € S then 
4 return TRUE 

5i else 

6 Find state s with € + p + v(s) < ras (s) + (Pas (8), v) 
7 return (1, es — Pa«(s)) 

8 end if 

9: end function 


Algorithm 27: Separation oracle for Eq. (38.6). 


Learning in Markov Decision Processes 


The problem of finding an optimal policy in an unknown MDP is no longer just 
an optimisation problem, and the notion of regret is introduced to measure the 
price of the uncertainty. For simplicity we assume that only the transition matrix 
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is unknown while the reward function is given. This assumption is not especially 
restrictive as the case where the rewards are also unknown is easily covered using 
either a reduction or a simple generalisation, as we explain in the notes. The 
regret of a policy m is the deficit of rewards suffered relative to the expected 
average reward of an optimal policy: 
n 
R, = no — Sora, (54) ' 

t=1 
The reader will notice we are comparing the non-random np* to the random 
sum of rewards received by the learner, which was also true in the study of 
stochastic bandits. The difference is that p* is an asymptotic quantity while for 
stochastic bandits the analogous quantity was nu*. The definition stills makes 
sense, however, because for MDPs with finite diameter D the optimal expected 
cumulative reward over n rounds is at least np* — D so the difference is negligible 
(Exercise 38.17). The main result of this chapter is the following: 


THEOREM 38.6. Let S, A and n be natural numbers and 6 € (0,1). There 
exists an efficiently computable policy n that when interacting with any MDP 
M =(S,A,P,r) with S states, A actions, rewards in [0,1] and any initial state 
distribution satisfies with probability at least 1 — ô, 


Ên < CD(M)S\/Anlog(nSA/5) , 
where C is a universal constant. 


In Exercise 38.18, we ask you to use the assumption that the rewards are 
bounded to find a choice of 6 € (0,1) such that 


Rn] < 1+ CD(M)S,/2Anlog(n). (38.12) 


This result is complemented by the following lower bound: 


THEOREM 38.7. Let S > 3, A> 2, D>6+4+2log,S and n > DSA. Then for 
any policy m there exists a Markov decision process with S states, A actions and 


diameter at most D such that 
E[R,] > CVDSAn, 
where C > 0 is again a universal constant. 


The upper and lower bounds are separated by a factor of at least V. DS, which 
is a considerable gap. Recent work has made progress towards closing this gap as 
we explain in the notes. 


Upper Confidence Bounds for Reinforcement Learning 


Reinforcement learning is the subfield of machine learning devoted to 
designing and studying algorithms that learn to maximise long-term reward 
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in sequential context. The algorithm that establishes Theorem 38.6 is called 
UCRL2 because it is the second version of the ‘upper confidence bounds for 
reinforcement learning’ algorithm. Its pseudocode is shown in Algorithm 28. 

At the start of each phase, UCRL2 computes an optimal policy for the 
statistically plausible MDP with the largest optimal gain. The details of this 
computation are left to the next section. This policy is then implemented until 
the number of visits to some state-action pair doubles when a new phase starts 
and the process begins again. The use of phases is important, not just for 
computational efficiency. Recalculating the optimistic policy in each round may 
lead to a dithering behaviour in which the algorithm frequently changes its plan 
and suffers linear regret (Exercise 38.19). 

To complete the specification of the algorithm, we must define confidence 
sets on the unknown quantity, which in this case is the transition matrix. The 
confidence sets are centered at the empirical transition probabilities defined by 


Ei {Su = s, Au = a, Susi = 8°} 


P,a J= uot , 
bals;s ) 1V T;(s,a) 


where T;(s,a) = X‘; 1{S, = s, Au = a} is the number of times action a was 
taken in state s. As before, we let P; a(s) be the vector whose s’th entry is 


P, a(s, 8’). Given a state-action pair s,a, define 


SLt-1(s, a) 


= : ||P = Pty SVIT Ga 


| , (38.13) 


where for T;(s,a) > 0 we set 


Li(s,a) = 2log (Heal + me) 


and for T;(s,a) = 0 we set L;(s,a) = 1. Note that in this case C,41(s,a) = P(S). 
Then define 


Ci = {P = (Pal(s))s,a : Pals) € Ci(s,a) for all s,a E€ S x A}. (38.14) 


Clearly T;(s,a) cannot be larger than the total number of rounds n, so 


4S An(n + 2) l 


(38.15) 


Li(s,a) < L = 2log ( 
The algorithm operates in phases k = 1,2,3,... with the first phase starting in 
round 7, = 1 and the (k + 1)th phase starting in round 7,41 defined inductively 
by 


Tk+1 = 1 + min {t : T;: (St, At) > 2T, —1(St, At) } 5 


which means that the next phase starts once the number of visits to some 
state-action pair at least doubles. 
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t<t+1, observe S; and take action A; = 7,%(S;) 
while TiC St, At) < 2T, —1 (St, At) 
end for 


1: Input S, A,r, 6 € (0,1) 

2: t=0 

3: for k = 1,2,... do 

A: Te =t+1 

5: Find 7 as the greedy policy with respect to vz satisfying Eq. (38.16) 
6: do 

ie 

8: 

9: 


Algorithm 28: UCRL2. 


The Extended Markov Decision Process 


The confidence set C+ defines a set of plausible transition probability functions at 
the start of round t. Since the reward function is known already, this corresponds 
to a set of plausible MDPs. The algorithm plays according to the optimal policy 
in the plausible MDP with the largest gain. There is some subtlety because 
the optimal policy is not unique, and what is really needed is to find a policy 
that is greedy with respect to a value function satisfying the Bellman optimality 
equation in the plausible MDP with the largest gain. Precisely, at the start of 
the kth phase, the algorithm must find a value function vz, gain px and MDP 
Mr = (S, A, Py, r) with Pr E Cr, such that 


Pk + vp (s) = max Ta(s) + (Pra(s), vk) for alls €S andaeé A, 

z (38.16) 
Pk = REE ER eee Pe (P). 
where p7 (P) is the gain of deterministic memoryless policy 7 starting in state s 
in the MDP with transition probability function P. The algorithm then plays 
according to 7, defined as the greedy policy with respect to vz. There is quite a 
lot hidden in these equations. The gain is only guaranteed to be constant when 
Mpk has a finite diameter, but this may not hold for all plausible MDPs. As it 
happens, however, solutions to Eq. (38.16) are guaranteed to exist and can be 
found efficiently. To see why this is true we introduce the extended MDP Mk, 
which has state space S and state-dependent action space A, given by 


A, = {(a, P) : a € A, PEC,,(s,a)} . 


The reward function of the extended MDP is F(a p) ($) = ra(s), and the transitions 
are Pa p(s) = Pa(s). The action space in the extended MDP allows the agent to 
choose both a € A and a plausible transition vector P,(s) € C;,(s,a). By the 
definition of the confidence sets, for any pair of states s,s’ and action a € A, 
there always exists a transition vector P,(s) € C;,(s,a) such that P,(s,s’) > 0, 
which means that My is strongly connected. Hence solving the Bellman optimality 
equation for Mp yields a value function vz and constant gain pẹ € R that satisfy 
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Eq. (38.16). A minor detail is that the extended action sets are infinite, while 
the analysis in previous sections only demonstrated existence of solutions to 
the Bellman optimality equation for finite MDPs. You should convince yourself 
that C;(s,a) is convex and has finitely many extremal points. Restricting the 
confidence sets to these points makes the extended MDP finite without changing 
the optimal policy. 


Computing the Optimistic Policy (Œ) 


Here we explain how to efficiently solve the Bellman optimality equation for the 
extended MDP. The results in Section 38.3 show that the Bellman optimality 
equation for Mp can be solved efficiently provided that for any value function 
v € RE computing 


argmaxye4 (rats) + a )) (38.17) 
can be carried out in an efficient manner. The inner optimisation is another linear 
program with S variables and O(S) constraints and can be solved in polynomial 
time. This procedure is repeated for each a € A to compute the outcome of 
(38.17). In fact the inner optimisation can be solved more straightforwardly by 
sorting the entries of v and then allocating P coordinate by coordinate to be as 
large as allowed by the constraints in decreasing order of v. The total computation 
cost of solving Eq. (38.17) in this way is O(S(A + logS)). Combining this with 
Algorithm 27 gives the required separation oracle. 

The next problem is to find an R such that the set of feasible solutions to the 
linear programs in Eq. (38.6) and Eq. (38.7) are contained in the set {x : ||z|| < R}. 
As discussed in Section 38.3.1, a suitable value is R = v1 + D?S, where D is 
an upper bound on the diameter of the MDP. It turns out that D = y/n works 
because for each pair of states s,s’, there exists an action a and P € C,,(s,a) 
such that P(s,s’) > 1A (1//n) so D(My) < yn. Combining this with the tools 
developed in Section 38.3 shows that the Bellman optimality equation for M, may 
be solved using linear programming in polynomial time. Note that the additional 
constraints require a minor adaptation of the separation oracle, which we leave 
to the reader. 


Proof of Upper Bound 


The proof is developed in three steps. First we decompose the regret into phases 
and define a failure event where the confidence intervals fail. In the second step, 
we bound the regret in each phase, and in the third step we sum over the phases. 
Recall that M = (S, A, P,r) is the true Markov decision process with diameter 
D = D(M). The initial state distribution is  € P(S), which is arbitrary. 
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Step 1: Failure Events and Decomposition 

Let K be the (random) number of phases, and for k € [K], let Ex = 
{Tk Tk + 1,...,Tk4}1 — 1} be the set of rounds in the kth phase, where TK +1 
is defined to be n + 1. Let T(z)(s, a) be the number of times state-action pair s, a 
is visited in the kth phase: 


Tr) (8, @) = 5 I{S; = s, At = a} 7 


tE Ep 
Define F as the failure event that P ¢ C,, for some k € [K]. 
LEMMA 38.8. P (F) < 6/2. 
The proof is based on a concentration inequality derived for categorical 


distributions and is left for Exercise 38.21. When F does not hold, the true 
transition kernel is in C,, for all k, which means that p* < p;, and 


re K 
R, =~ (o* -ra(S:)) < 2 S (Pr = ra. (S:)) - 


=i tE Ek 


+ 


Rr 


In the next step, we bound Ry under the assumption that F does not hold. 


Step 2: Bounding the Regret in Each Phase 

Assume that F does not occur and fix k € [K]. Recall that vp is a value function 
satisfying the Bellman optimality equation in the optimistic MDP My, and px is 
its gain. Hence 


Pk = Tm, (8) — Ue (8) + (Prin, (8), Un) for alls eS. (38.18) 
As noted earlier, solutions to the Bellman optimality equation remain solutions 


when translated, so we may assume without loss of generality that uv, is such that 
\|Ux|loo < span(vk)/2, which means that 


, (38.19) 


v| o 


1 
llvzlloo < z Span(vr) < 


where the second inequality follows from Lemma 38.3 and the fact that when F 
does not hold, the diameter of the extended MDP My is at most D and vz also 
satisfies the Bellman optimality equation in this MDP. By the definition of the 
policy, we have A; = mk( St) for t € Ex, which implies that 


Pk = TA (Si) — up(S;) + (Pra, (St), Up) for allt € Eg. 
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Rearranging and substituting yields 


Re = X (ve (St) + (Pe, ae(St); Ue) 
tE Ek 
= =5_ —vr( (S) 4 + (Pa, (St), ve)) EJ XO (Pr, a: (St) — P4, (St), vk) 
tC Ey te Ey; 
< So (-ve(S:) + + (PaCS) o) + 5 XO [Pk a(S) — Pa, (Solla; (38.20) 
tEEk ` tEEk 


(A) (B) 


where the inequality follows from Hölder’s inequality and Eq. (38.19). Let 
Z4[-] denote the conditional expectation with respect to P conditioned on 
o(S1, A1,- -, St—1; At_i, S+). To bound (A), we reorder the terms and use the 
fact that span(vk) < D on the event F°. We get 


(A) = 5 (Uk (St41) — Vel St) + (Pa, (St), Uk) — Ue (St41)) 


tE Ek 
= Vil Sees) — vkl Sr) + 5 ((Pa, (S+), Uk) = Up (S¢41)) 
tEEk 
<D+ 5 (Ez[ve(Se41)] — ve(St44)) , 


te E, 


where the second equality used that max Ek = Tk41 — 1 and min Ek = Tk. We 
leave this here for now and move on to term (B) in Eq. (38.20). The definition of 
the confidence intervals and the assumption that F does not occur shows that 


DVLS y Tik) (s, a) 
(s,a)ESxXxA V 1V T,,-1(8, 4) 


Combining the bounds (A) and (B) yields 


(B) < 


` , DVLS Tx) (8, @) 
Res D+ YO (Ells =S es 


tEEk (s,a)ESXA 


Step 3: Bounding the Number of Phases and Summing 
Let K; be the phase in round t so that t € Ex,. By the work in the previous two 
steps, if F does not occur, then 


Ulur, (St+1)] — vr, (St+1)) 


ip 
è 
+ 


2 pre y 3 Tr) ( s,a) 


E (s,a)ESx A k=1 LV T;,-1( 
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The first sum is bounded using a version of Hoeffding-Azuma (Exercise 20.7): 


nlog(2/6) 2 ô 
2 


S5: 


P G and S © (E:[vx,(St41)] — vx. (Si41)) > D 


For the second term, we note that T(,)(s,a)/./1V T;,-1(s,@) cannot be large 
too often. A continuous approximation often provides intuition for the correct 
form. Recalling the thousands of integrals you did at school, for any differentiable 
f :[0,00) > R, 


FE) ak = 2/F(K) -2V FO. 38.21 
TE f(B) (0) (38.21) 


Here we are thinking of f(k) as the continuous approximation of T,,-1(s,a) and 
its derivative as T(,)(s,@). In Exercise 38.22, we ask you to make this argument 
rigourous by showing that 


3 Tolsa 5 < (v2+1) Valea). 
A Ty ak 
Then by Cauchy-Schwarz and the fact that >? acsxa In(s,@) = 7, 


SY VIn(s,0) < VSAn. 


sES acA 


It remains to bound the number of phases. A new phase starts when the visit 
count for some state-action pair doubles. Hence K cannot be more than the 
number of times the counters double in total for each of the states. It is easy to 
see that 1 + log, Tn(s,a) gives an upper bound on how many times the counter 
for this pair may double (the constant 1 is there to account for the counter 
changing from zero to one). Thus K < kK’ = }., a 1 + log, Tn (s, a). Noting that 
0 < Ta(s,a) and >°, a Tn(s,a) = n and relaxing Tn(s,a) to take real values, we 
find that the value of K’ is the largest when 7),(s, a) = n/(SA), which shows that 


K<SA (1+ log, (<x) 


Putting everything together gives the desired result. 


Proof of Lower Bound 


The lower bound is proven by crafting a difficult MDP that models a bandit 
with approximately SA arms. This is a cumbersome endeavour, but intuitively 
straightforward, and the explanations that follow should be made clear in Fig. 38.3. 
Given S and A, the first step is to construct a tree of minimum depth with at 
most A children for each node using exactly S — 2 states. The root of the tree is 
denoted by sə and transitions within the tree are deterministic, so in any given 
node, the learner can simply select which child to transition to. Let L be the 


38.7 Proof of Lower Bound 530 


number of leaves, and label these states s;,...,57. The last two states are sg 
and s, (‘good’ and ‘bad’ respectively). For each 7 € [L], the learner can take any 
action a € A and transitions to either the good state or the bad state according 
to 


1 
Palsi, Sg) = z 


The function € will be chosen so that e(a, i) = 0 for all (a,i) pairs except one. For 


1 
+ e(a, i) and P,(8;, 85) = ri e(a,t). 


this special state-action pair, we let e(a,i) = A for appropriately tuned A > 0. 
The good state and the bad state have the same transitions for all actions: 


P,(89, Sg) =1-—6, Palsy 50) = 0; 
Po(34y8,) =1— Ô, P,(Sp, So) = ô. 


Choosing ô = 4/D, which under the assumptions of the theorem is guaranteed 
to be in (0, 1], ensures that the diameter of the described MDP is at most D, 
regardless of the value of A. The reward function is ra(s) = 1 if s = s, and 
rals) = 0 otherwise. 

The connection to finite-armed bandits is straightforward. Each time the learner 
arrives in state so, it selects which leaf to visit and then chooses an action from 
that leaf. This corresponds to choosing one of k = LA = Q(SA) meta actions. 
The optimal policy is to select the meta action with the largest probability of 
transitioning to the good state. The choice of 6 means the learner expects to 
stay in the good/bad state for approximately D rounds, which also makes the 
diameter of this MDP about D. This means the learner expects to make about 
n/D decisions and the rewards are roughly in [0, D], so we should expect the 
regret to be 0(D,/kn/D) = Q(VnDSA). 


6,1 6,0 


Good state Bad state 


1—ô,1 1—6,0 


Figure 38.3 Lower-bound construction for A = 2 and S = 8. The resulting MDP is 
roughly equivalent to a bandit with six actions. 
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One could almost claim victory here and not bother with the proof. As usual, 
however, there are some technical difficulties, which in this case arise because the 
number of visits to the decision state so is a random quantity. For this reason we 
give the proof, leaving as exercises the parts that are both obvious and annoying. 


Proof of Theorem 38.7 The proof follows the path suggested in Exercise 15.2. 
We break things up into two steps. Throughout we fix an arbitrary policy m. 


Step 1: Notation and Facts about the MDP 

Let d be the depth of the tree in the MDP construction, L the number of leaves 
and k = LA. Define the set of state-action pairs for which the state is a leaf of 
the tree by 


L = {(s,a):a€ A and s is a leaf of the tree}. 


By definition, this has k elements. Let Mo be the MDP with e(s, a) = 0 for all 
(s,a) E€ £. Then let M; be the MDP with e(s,a) = A for the jth state-action 
pair in the above set. Define stopping time 7 by 


t 
ronamin{t: Sa(s =a) > 5 ah, 


u=1 


which is the first round when the number of visits to state so is at least n/D — 1, 
or n if so is visited fewer times than n/D. Next, let Tj be the number of visits to 
state-action pair j € [k] until stopping time 7 and T, = Da Tj. Fr0 < j< k, 
let P; be the law of T),...,7, induced by the interaction of 7 and M}. And 
let E;[] be the expectation with respect to P;. None of the following claims is 
surprising, but they are all tiresome to prove to some extent. The claims are 
listed in increasing order of difficulty and left to the reader in Exercise 38.24. 


CLAIM 38.9. For all j € |k], the diameter is bounded by D(M;) < D. 


CLAIM 38.10. There exist universal constants 0 < cy < cg < œ such that 


DEo[To]/n € [e1, c2] - 


CLAIM 38.11. Let Raj be the expected regret of policy n in MDP Mj over n 
rounds. There exists a universal constant c3 > 0 such that 


Ryj > c3AD ij [To = T;] $ 


Step 2: Bounding the Regret 

Notice that Mo and M; only differ when state-action pair j is visited. In 
Exercise 38.30, you are invited to use this fact and the chain rule for relative 
entropy given in Exercise 14.13 to prove that 


D(Po, P;) = Eo[T;]d(1/2,1/2 + A), (38.22) 
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where d(p, q) is the relative entropy between Bernoulli distributions with means 
p and q, respectively. Now A will be chosen to satisfy A < 1/4. It follows from 
the entropy inequalities in Eq. (14.16) that 


D(Po,P;) < 4A7Eo[T;] . (38.23) 


Using the fact that 0 < T, —T; < Tọ < n/D, Exercise 14.4 and Pinsker’s 
inequality (Eq. (14.12)) and (38.23), 


i | n [D(Po,P;) | nA - 
bj [To — Tj] > Eo [To — T;] DV 5 jJi> v [To — Tj] - = 2Eo[T)] - 


Summing over j and applying Cauchy—Schwarz yields 


k k 


; ; nA 
DLE; [To — Tj] = Eo [To - Tj] - -5 2 oli 
j=l =L j=l 
> (k — 1)Eo [To] - "SV 2RBol Te 
` ank—-1) nA /2conk 
7 D D D 
cin(k = 1) 
> oi ; (38.24) 


where the last inequality follows by choosing 


A= ci(k = 1) D 
2 V 2conk 


By Eq. (38.24), there exists a j € [k] such that 


cyn(k =- 1) 
2Dk 
Then, for the last step, apply Claim 38.11 to show that 


ccsn(k — 1)? D 
Ra; > cgDAE,|T, —T;] > 2 : 
j = c3 DAE;[ 2 4k V 2conk 


Naive bounding and simplification concludes the proof. 


ij [To T;] 2 


Notes 


1 MDPs in applications can have millions (or ‘billions and billions’) of states, 
which should make the reader worried that the bound in Theorem 38.6 could 
be extremely large. The takeaway should be that learning in large MDPs 
without additional assumptions is hard, as attested by the lower bound in 
Theorem 38.7. 
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2 The key to choosing the state space is that the state must be observable and 
sufficiently informative that the Markov property is satisfied. Blowing up the 
size of the state space may help to increase the fidelity of the approximation 
(the entire history always works), but will almost always slow down learning. 

3 We simplified the definition of MDPs by making the rewards a deterministic 
function of the current state and the action chosen. A more general definition 
allows the rewards to evolve in a random fashion, jointly with the next state. 
In this definition, the mean reward functions are dropped and the transition 
kernel P, is replaced with an S — S x R stochastic kernel, call it, P,. Thus, 
for every s € S, P,(s) is a probability measure over S x R. The meaning of this 
is that when action a is chosen in state s, a random transition, (S, R) ~ P,(s) 
happens to state S, while reward R is received. Note that the mean reward 
along this transition is ra(s) = f «P,(s, ds’, dz). 

4 A state s E€ S is absorbing if P,(s, s) = 1 for alla € A. An MDP is episodic if 
there exists an absorbing state that is reached almost surely by any policy. The 
average reward criterion is meaningless in episodic MDPs because all policies 
are optimal. In this case the usual objective is to maximise the expected reward 
until the absorbing state is reached without limits or normalisation, sometimes 
with discounting. An MDP is finite-horizon if it is episodic and the absorbing 
state is always reached after some fixed number of rounds. The simplification 
of the setting eases the analysis and preserves most of the intuition from the 
general setting. 

5 A partially observable MDP (POMDP) is a generalisation where the learner 
does not observe the underlying state. Instead they receive an observation 
that is a (possibly random) function of the state. Given a fixed (known) initial 
state distribution, any POMDP can be mapped to an MDP at the price of 
enlarging the state space. A simple way to achieve this is to let the new state 
space be the space of all histories. Alternatively you can use any sufficient 
statistic for the hidden state as the state. A natural choice is the posterior 
distribution over the hidden state given the interaction history, which is called 
the belief space. While the value function over the belief space has some nice 
structure, in general even computing the optimal policy is hard [Papadimitriou 
and Tsitsiklis, 1987]. 

6 We called the all-knowing entity that interacts with the MDP an agent. In 
operations research the term is decision maker and in control theory it is 
controller. In control theory the environment would be called the controlled 
system or the plant (for power-plant, not a biological plant). Acting in 
an MDP is studied in control theory under stochastic optimal control, 
while in operations research the area is called multistage decision making 
under uncertainty or multistage stochastic programming. In the control 
community the infinite horizon setting with the average cost criterion is perhaps 
the most common, while in operations research the episodic setting is typical. 

7 The definition of the optimal gain that is appropriate for MDPs that are not 
strongly connected is a vector p* € RS given by p* = sup, p7. A policy is 


10 


11 
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optimal if it achieves the supremum in this definition and such a policy always 
exists as long as the MDP is finite. In strongly connected MDPs, the two 
definitions coincide. For infinite MDPs, everything becomes more delicate and 
a large portion of the literature on MDPs is devoted to this case. 

In applications where the asymptotic nature of gain optimality is unacceptable, 
there are criteria that make finer distinctions between the policies. A memoryless 
policy 7* is bias optimal if it is gain optimal and vr» > Ur for all memoryless 
policies 7. Even more sensitive criteria exist. Some keywords to search for are 
Blackwell optimality and n-discount optimality. 

The Cesàro sum of a real-valued sequence (an)n is the asymptotic average of 
its partial sums. Let Sn = ao + +- + an-ı be the nth partial sum. The Cesaro 
+(s; +--+ sn) when this limit exists. 
The idea is that Cesaro summation smoothes out periodicity, which means that 
for certain sequences the Cesaro sum exists while s, does not converge. For 
example, the alternating sequence (+1, —1,+1,—1,...) is Cesaro summable, 
and its Cesaro sum is easily seen to be 1/2, while it is not summable in the 
normal sense. If a sequence is summable, then its sum and its Cesaro sum 
coincide. The differential value of a policy is defined as a Cesaro sum so that it 
is well defined even if the underlying Markov chain has periodic states. 

For y € (0,1), the y-discounted average of sequence (an)n is Ay = (1 — 
Y) Oo Yan. An elementary argument shows that if A, is well defined, then 
A, = (1-7)? Ep1 Tsn. Suppose the Cesàro sum A = limno + 741 St 
exists, then using the fact that 1 = (1 —y)? Dp; 7"~'n, we have A, — A= 
(1-7)? E yT (Sn —n A). It is not hard to see that | 37°, y"~1(s,—nA)| = 
O(1/(1 — y)), and thus A, — A = O(1-— y) as y —> 1, which means that 
lim,_,; A, = A. The value lim,_,; A, is called the Abel sum of (an)n. Put 
simply, the Abel sum of a sequence is equal to its Cesaro sum when the latter 
exists. Abel summation is stronger in the sense that there are sequences that 
are Abel summable but not Cesaro summable. The approach of approximating 
Cesaro sums through y-discounted averages, and taking the limit as y — 1 is 
called the vanishing discount approach and is one of the standard ways 
to prove that the (average reward) Bellman equation has a solution (see 
Exercises 38.9 and 38.10). As an aside, the systematic study of how to define 
the ‘sum’ of a divergent series is a relatively modern endeavour. An enjoyable 
historical account is given in the first chapter of the book on the topic by 
Hardy [1973]. 

Given a solution (p,v) to Eq. (38.6), we mentioned a procedure for finding 
a state § € S that is recurrent under some optimal policy. This works as 
follows. Let Co = {(s,a) : p+ v(s) = rals) + (Pa(s),v)} and Ip = {s : 
(s,a) € Co for some a € A}. Then define Cy41 and I,41 inductively by the 
following algorithm. First find an (s,a) € Cy such that P,(s,s’) > 0 for some 
s' Z Ix. If no such pair exists, then halt. Otherwise let Cy41 = Ck \ {(s,a)} 
and Ip4i = {s : (s,a) E€ Cy41 for some a € A}. Now use the complementary 
slackness conditions of the dual program to Eq. (38.6) to prove that the 


sum of this sequence is A = limno 
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algorithm halts with some non-empty J; and that these states are recurrent 
under some optimal policy. For more details, have a look at Exercise 4.15 of 
the second volume of the book by Bertsekas [2012]. 

We mentioned enumeration, value iteration and policy iteration as other 
methods for computing optimal policies. Enumeration just means enumerating 
all deterministic memoryless policies and selecting the one with the highest 
gain. This is obviously too expensive. Policy iteration is an iterative process 
that starts with a policy ao. In each round, the algorithm computes 7,41 
from mt, by computing Vr, and then choosing 7,41 to be the greedy policy 
with respect to v,z,. This method may not converge to an optimal policy, 
but by slightly modifying the update process, one can prove convergence. 
For more details, see chapter 4 of volume 2 of the book by Bertsekas [2012]. 
Value iteration works by choosing an arbitrary value function vo and then 
inductively defining v,41 = Tup, where (Tv)(s) = maXxacA Tals) + (Pals), v) 
is the Bellman operator. Under certain technical conditions, one can prove 
that the greedy policy with respect to vz converges to an optimal policy. Note 
that vk+1ı = Q(k), which can be a problem numerically. A simple idea is to 
let Ups1 = Tuk — ôk where 6, = maxses Vk(s). Since the greedy policy is the 
same for v and v + c1, this does not change the mathematics, but improves 
the numerical situation. The aforementioned book by Bertsekas is again a 
good source for more details. Unfortunately, none of these algorithms have 
known polynomial time guarantees on the computation complexity of finding 
an optimal policy without stronger assumptions than we would like. In practice, 
however, both value and policy iteration work quite well, while the ellipsoid 
method for solving linear programs should be avoided at all costs. Of course 
there are other methods for solving linear programs, and these can be effective. 
Theorem 38.6 is vacuous when the diameter is infinite, but you might wonder if 
the bound continues to hold in certain ‘nice’ cases. Unfortunately, the algorithm 
is rather brittle. UCRL2 suffers linear regret if there is a single unreachable 
state with reward larger than the optimal gain (Exercise 38.27). 

One can modify the concept of regret to allow for MDPs that have traps. We 
restrict our attention to policies with sublinear regret in strongly connected 
MDPs, which must try and explore the whole state space and hence almost 
surely become trapped in a strongly communicating subset of the state space. 
The regret is redefined by ‘restarting the clock’ at the time when the policy 
gets trapped. For details, see Exercise 38.29. 

The assumption that the reward function is known can be relaxed without 
difficulty. It is left as an exercise to figure out how to modify algorithm and 
analysis to the case when r is unknown and reward observed in round t is 
bounded in [0,1] and has conditional mean r4,(S;). See Exercise 38.23. 
Although it has not been done yet in this setting, the path to removing the 
spurious VS from the bound is to avoid the application of Cauchy—Schwarz 
in Eq. (38.20). Instead one should define confidence intervals directly on 
(Ê, — P, vp), where the dependence on the state and action has been omitted. 
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Of course, the algorithm must be changed to use the improved confidence 
intervals. At first sight, it seems that one could apply Hoeffding’s bound 
directly to the inner product, but there is a subtle problem that has spoiled a 
number of attempts: vz, and Ê, are not independent. This non-independence 
is unfortunately quite pernicious and appears from many angles. We advise 
extreme caution. Some references for guidance are given in the bibliographic 
remarks. 


Bibliographical Remarks 


The study of sequential decision-making has a long 
history, and we recommend the introduction of the 
book by Puterman [2009] as a good starting point. 
One of the main architects in modern times is Richard 
Bellman, who wrote an influential book [Bellman, 
1954]. His autobiography is so entertaining that 
reading it slowed the writing of this chapter: The ’ 
Eye of the Hurricane [Bellman, 1984]. As a curiosity, Y 


ai 


Bellman knew about bandit problems after accidentally AC 
encountering a paper by Thompson [1935]. For the | fee NÀ 
tidbit, see page 260 of the aforementioned biography. 

MDPs are studied by multiple research communities, Richard Bellman 


including control, operations research and artificial 

intelligence. The two-volume book by Bertsekas [2012] provides a thorough and 
formal introduction to the basics. The perspective is quite interdisciplinary, 
but with a slight (good) bias towards the control literature. The perspective 
of an operations researcher is most precisely conveyed in the comprehensive 
book by Puterman [2009]. A very readable shorter introductory book is by Ross 
[1983]. Arapostathis et al. [1993] surveyed existing analytical results (existence, 
uniqueness of optimal policies, validity of the Bellman optimality equation) for 
average-reward MDPs with an emphasis on continuous state and action space 
models. The online lecture notes of Kallenberg [2016] are a recent comprehensive 
alternative account for the theory of discrete MDPs. There are many texts on 
linear/convex optimisation and the ellipsoid method. The introductory book on 
linear optimisation by Bertsimas and Tsitsiklis [1997] is a pleasant read, while 
the ellipsoid method is explained in detail by Grétschel et al. [2012]. 

The problem considered in this chapter is part of a broader field called 
reinforcement learning (RL), which has recently seen a surge of interest. The 
books by Sutton and Barto [2018] and Bertsekas and Tsitsiklis [1996] describe 
the foundations. The first book provides an intuitive introduction aimed at 
computer scientists, while the second book focuses on the theoretical results of 
the fundamental algorithms. A book by one of the present authors focuses on 
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cataloguing the range of learning problems encountered in reinforcement learning 
and summarising the basic ideas and algorithms [Szepesvari, 2010]. 

The UCRL algorithm and the upper and lower regret analysis is due to 
Auer et al. [2009] and Jaksch et al. [2010]. Our proofs differ in minor ways. A 
more significant difference is that these works used value iteration for finding 
the optimistic policy and hence cannot provide polynomial time computation 
guarantees. In practice this may be preferable to linear programming anyway. 

The number of rigourous results for bounding the regret of various algorithms is 
limited. One idea is to replace the optimistic approach with Thompson sampling, 
which was first adapted to reinforcement learning by Strens [2000] under the 
name PSRL (posterior sampling reinforcement learning). Agrawal and Jia [2017] 
recently made an attempt to improve the dependence of the regret on the state 
space. The proof is not quite correct, however, and at the time of writing the 
holes have not yet been patched. Azar et al. [2017] also improve upon the UCRL2 
bound, but for finite-horizon episodic problems, where they derive an optimistic 
algorithm with regret O(VHSAn), which after adapting UCRL to the episodic 
setting improves on its regret by a factor of VSH. The main innovation is to 
use Freedman’s Bernstein-style inequality for computing bonuses directly while 
computing action values using backwards induction from the end of the episode 
rather than keeping confidence estimates for the transition probabilities. An 
issue with both of these improvements is that lower-order terms in the bounds 
mean they only hold for large n. It remains to be seen if these terms arise from 
the analysis or if the algorithms need modification. UCRL2 will fail in MDPs 
with infinite diameter, even if the learner starts in a subset of the states that 
is strongly connected from which it cannot escape. This limitation was recently 
overcome by Fruit et al. [2018], who provide an algorithm with roughly the same 
regret as UCRL2, but where the dependence on the diameter and state space are 
replaced with those of the sub-MDP in which the learner starts and from which 
it is assumed there is no escape. 

Tewari and Bartlett [2008] use an optimistic version of linear programming 
to obtain finite-time logarithmic bounds with suboptimal instance-dependent 
constants. Note this paper mistakenly drops some constants from the confidence 
intervals, which after fixing would make the constants even worse and seems to 
have other problems, as well [Fruit et al., 2018]. Similar results are also available 
for UCRL2 [Auer and Ortner, 2007]. Burnetas and Katehakis [1997a] prove 
asymptotic guarantees with optimal constants, but with the crucial assumption 
that the support of the next-state distributions P,(s) are known. Lai and Graves 
[1997] also consider asymptotic optimality. However, they consider general state 
spaces where the set of transition probabilities is smoothly parameterised with a 
known parameterisation but under the weakened goal of competing with the best 
of finitely many memoryless policies given to the learner as black boxes. 

Finite-time regret for large state and action space MDPs under additional 
structural assumptions are also considered by Abbasi-Yadkori and Szepesvari 
[2011], Abbasi-Yadkori [2012] and Ortner and Ryabko [2012]. Abbasi-Yadkori and 
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Szepesvari [2011] and Abbasi-Yadkori [2012] give algorithms with O(,/n) regret 
for linearly parameterised MDP problems with quadratic cost (linear quadratic 
regulation, or LQR), while Ortner and Ryabko [2012] give O(n@4+)/@4+2)) regret 
bounds under a Lipschitz assumption, where d is the dimensionality of the state 
space. The algorithms in these works are not guaranteed to be computationally 
efficient because they rely on optimistic policies. In theory, this could be addressed 
by Thompson sampling, which is considered by Abeille and Lazaric [2017b], who 
obtain partial results for the LQR setting. Thompson sampling has also been 
studied in the Bayesian framework by Osband et al. [2013], Abbasi-Yadkori 
and Szepesvari [2015], Osband and Van Roy [2017] and Theocharous et al. 
[2017], of which Abbasi-Yadkori and Szepesvari [2015] and Theocharous et al. 
[2017] consider general parametrisations, while the other papers are concerned 
with finite state-action MDPs. Learning in MDPs has also been studied in the 
probability approximately correct (PAC) framework introduced by Kearns and 
Singh [2002], where the objective is to design policies for which the number 
of badly suboptimal actions is small with high probability. The focus of these 
papers is on the discounted reward setting rather than average reward. The 
algorithms are again built on the optimism principle. Algorithms that are known 
to be PAC-MDP include R-max [Brafman and Tennenholtz, 2003, Kakade, 2003], 
MBIE [Strehl and Littman, 2005, 2008], delayed Q-learning [Strehl et al., 2006], 
the optimistic-initialisation-based algorithm of Szita and Lérincz [2009], MorMax 
by Szita and Szepesvari [2010], and an adaptation of UCRL by Lattimore and 
Hutter [2012], which they call UCRLy. The latter work presents optimal results 
(matching upper and lower bounds) for the case when the transition structure 
is sparse, while the optimal dependence on the number of state-action pairs 
is achieved by delayed Q-learning and Mormax [Strehl et al., 2006, Szita and 
Szepesvari, 2010], though the Mormax bound is better in its dependency on the 
discount factor. The idea to incorporate the uncertainty in the transitions into 
the action space to solve the optimistic optimisation problem appeared in the 
analysis of MBIE [Strehl and Littman, 2008]. A hybrid between stochastic and 
adversarial settings is when the reward sequence is chosen by an adversary, while 
transitions are stochastic. This problem has been introduced by Even-Dar et al. 
[2004]. State-of-the-art results for the bandit case are due to Neu et al. [2014], 
where the reader can also find further pointers to the literature. The case when 
the rewards and the transitions probability distributions are chosen adversarially 
is studied by [Abbasi-Yadkori et al., 2013). 


Exercises 


38.1 (EXISTENCE OF PROBABILITY SPACE) Let M = (S,A,P) be a finite 
controlled Markov environment, which is a finite MDP without the reward 
function. A policy 7 = (m+); is a sequence of probability kernels where 7; 
is from (S x A)*"! x S to A. Given a policy 7 and initial state distribution 
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u E€ P(S), show there exists a probability space (Q, F, P) and an infinite sequence 
of random elements S1, A1, S2, A2,... such that for any s E€ S,a E€ A and t EN, 


(a) P(S: = s) = p(s); 
(b) P(St+1 = S$ | S1, A1,- -3 St, At) = Pa, (Sz, 8); and 
Cc) P(A; = a| S1, A1,. . ., St) = mila | S1, Ai, oe <, St-1, Ag—1, St). 


HINT Use Theorem 3.3. 


38.2 (SUFFICIENCY OF MARKOV POLICIES) Let M = (S,A,P) be a finite 
controlled Markov environment, m be an arbitrary policy and u € P(S) an 
arbitrary initial state distribution. Denote by P7, the probability distribution that 
results from the interconnection of 7 and M, while the initial state distribution 


is u. 
(a) Show there exists a Markov policy z’ such that 
P(S: = s, At = a) = P7 (S: = a, Ay = a) 


holds for allt > 1 and s,a € S x A. 
(b) Conclude that for any policy m there exists a Markov policy 7’ such that for 
any s € S, P? = P7 . 


HınT Define x’ inductively starting at t = 1. Puterman [theorem 5.5.1 2009] 
proves this result and credits Strauch [1966]. 


38.3 (DETERMINISTIC POLICIES MINIMISE TRAVEL TIME) Let P be some 
transition structure over some finite state space S and some finite action space 
A. Show that the expected travel time between two states s,s’ of S is minimised 
by a deterministic policy. 


HıNT Let 7*(s,5’) be the shortest expected travel time between some arbitrary 
pairs of states, which for s = s’ is defined to be zero. Show that 7* satisfies the 
fixed point equation 


0 ifs=s'; 
reat)={ : os 


1+ ming 50,0 Pa(s,s”) 7*(s",s'), otherwise. 


38.4 (STRONGLY CONNECTED © FINITE DIAMETER) Let M be a finite MDP. 
Prove that D(M) < oo is equivalent to M being strongly connected. 


38.5 (DIAMETER LOWER BOUND) Let M = (S, A, P,r) be any MDP. Show that 
D(M) > loga(S) — 3. 


HINT Denote by d*(s,s’) the minimum expected time it takes to reach 
state s’ when starting from state s. The definition of d* can be extended to 
arbitrary initial distributions uo over states and sets U C S of target states: 
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d* (uo, U) = $, pols) X sey £ (s, 8"). Prove by induction on the size of U that 


d* (tio, U) > min Seis O<m < AF, k>0,S n= |U] (38.25) 
k>0 k>0 


and then conclude that the proposition holds by choosing U = S [Jaksch et al., 
2010, corollary 15]. 


38.6 (STATE VISITATION PROBABILITIES AND CUMULATIVE REWARD) Let 
M = (S, A, P,r) be an MDP and 7 a memoryless policy and i,j € [S]. 


(a) Show that e; Ptej is the probability of arriving in state j from state i in t 
rounds using policy r. 

(b) Show that e} X`; Pir, is the expected cumulative reward collected by 
policy m over n rounds when starting in state i. 


38.7 (STOCHASTIC MATRICES) Let P be any S x S right stochastic matrix. Show 
that the following hold: 


(a) An = 1307p P’ is right stochastic. 

(b) A, + 4(P" — I) = AnP = PAn. 

(e) P* = lifin 4 a Pt exists and is right stochastic. 

(dq) P*P=PP*=P*P*=FP*. 

(e) The matrix H = (I — P + P*)~ is well defined. 

(£) Let U = H — P*. Then U = lit Di Diao (P — P*). 

(g) Let r € R’ and p= P*r. Then v = limpo + Xi ae PF(r — p) is well 
defined and satisfies (38.3). 

(h) With the notation of the previous part, v + p = r + Pu. 


Hint Note that the first four parts of this exercise are the same as in Chapter 37. 
For parts (c) and (d), you will likely find it useful that the space of right stochastic 
matrices is compact. Then show that all cluster points of (A,,) are the same. For 
(g), show that v = Ur. 


The previous exercise shows that the gain and differential value function of 
any memoryless policy in any MDP are well defined. The matrix H is called 
the fundamental matrix, and U is called the deviation matrix. 


38.8 (DISCOUNTED MDPs) Let y € (0,1), and define the operator Ty : RS + RS 
by 


(Tyv)(s) = max ra(s) + ¥(Pa(s),v) - 
acA 
(a) Prove that T is a contraction with respect to the supremum norm: 


[Tyv — Tywlloo < yllv — wll for any v,w € R®. 
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(b) Prove that there exists a v € RS such that Tyv = v. 

(c) Let m be the greedy policy with respect to v. Show v =r, + yPrv. 

(a) Prove that v = (I — yP) trr. 

(e) Define the y-discounted value function vý of a policy m as the function 
that for any given state s € S gives the total expected discounted reward 
of the policy when it is started from state s. Let vł € RS be defined by 

v;(s) = max, v7 (s), s € S. We call m y-discount optimal if vš = vy. Show 

that if m is greedy with respect to v from part (b), then 7 is a y-optimal 

policy. 


HInT For (b), you should use the contraction mapping theorem (or Banach 
fixed point theorem), which says that if (V,d) is a complete metric space and 
T : X > satisfies d(T(x),T(y)) < yd(x,y) for y € [0,1), then there exists an 
x € X such that T(x) = x. For (e), use (d) and Exercise 38.2 to show that 
it suffices to check that vj} < v for any Markov policy m. Verify this by using 
the fact that Ty is monotone (f < g implies that Tyf < Tg) and showing that 
V7 n < T70 holds for any n, where v7 „(s) is the total expected discounted reward 
of the policy when it is started from state s and is followed for n steps. 


38.9 (FROM DISCOUNTING TO AVERAGE REWARD) Recall that H = (I — P+ 
P*)"*, U = H — P*. For y € (0,1), define PY = (1 — y)(I —7P) “*. Show that 


(a) limy4i1—- P% = P*; 


‘ P*—p* 
(b) lim,y_,1— ta = U. 
Hint For (a) start by manipulating the expressions P% P and (P% \) T1 P*, For 


(b) consider H~*(P* — P*). 


38.10 (SOLUTION TO BELLMAN OPTIMALITY EQUATION) In this exercise you 
will prove part (a) of Theorem 38.2. 


(a) Prove there exists a deterministic stationary policy 7 and increasing sequence 
of discount rates (yn) with yn < 1 and limno Yn = 1 such that 7 is a 
greedy policy with respect to the fixed point vn of Ty, for all n. 

(b) For the remainder of the exercise, fix a policy m whose existence is guaranteed 
by part (a). Show that p” = p1 is constant. 

(c) Let v = vr be the value function and p = pr the gain of policy m. Show that 
(p, v) satisfies the Bellman optimality equation. 


HINT For (a), use the fact that for finite MDPs there are only finitely many 
memoryless deterministic policies. For (b) and (c), use Exercise 38.9. 


38.11 (COUNTERINTUITIVE SOLUTIONS TO THE BELLMAN EQUATION) Consider 
the deterministic MDP shown below with two states and two actions. The first 
action, STAY, keeps the state the same and the second action, GO, moves the 
learner to the other state while incurring a reward of —1. Show that in this 
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example, solutions (p,v) to the Bellman optimality equations (Eq. (38.5)) are 
exactly the elements of the set 


{(p,v) € Rx R? : p=0, v(1) —1 < o(2) < o(1) +1}. 


r=0 r=0 
is "I jy 
P= =! 


38.12 (DANGERS OF LINEAR PROGRAM RELAXATION) Give an example of an 
MDP and a solution (p, v) to the linear program in Eq. (38.6) such that v does 
not satisfy the Bellman optimality equation and the greedy policy with respect 
to v is not optimal. 


38.13 (BOUND ON SPAN IN TERMS OF DIAMETER) Let M bea strongly connected 
MDP and (p,v) be a solution to the Bellman optimality equation. Show that 
span(v) < (p* — ming a 7a(s))D(M). 


Hint Note that by Theorem 38.2, p = p*. Fix some states sı Æ s2 anda 
memoryless policy 7. Show that 


v(s2) = o(81) < (p* — min ra(s))E" [rss | $1 = 81]: 


Note for the sake of curiosity that the above display continues to hold for weakly 
communicating MDPs. 


The proof of Theorem 4 in the paper by Bartlett and Tewari [2009] is 
incorrect. The problem is that the statement needs to hold for any solution 
v of the Bellman optimality equation. The proof uses an argument that 
hinges on the fact that in an aperiodic strongly connected MDP, v is in the 
set {cl + limpo 7T”"0 — np* : c € R}. However, Exercise 38.11 shows that 
there exist strongly connected MDPs where this does not hold. 


38.14 (SEPARATION ORACLES) Solve the following problems: 


(a) Prove that Algorithm 27 provides a separation oracle for convex set K defined 
in Eq. (38.10). 

(b) Assuming that Algorithm 27 can be implemented efficiently, explain how to 
find an approximate solution to Eq. (38.7). 


38.15 (COMBINING SEPARATION ORACLES) Let K C R@ be a convex set and ¢ be 
a separation oracle for K. Suppose that a1,...,@, is a collection of vectors with 
ay € R? and by,...,b% be a collection of scalars. Let Hy, = {x € R? : (az, x) > bk}. 
Devise an efficient separation oracle for gı £ N Hx. 
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38.16 (APPROXIMATE SOLUTIONS TO BELLMAN EQUATION) Consider a strongly 
connected MDP, and suppose that p and v approximately satisfy the Bellman 
optimality equation in the sense that there exists an £ > 0 such that 


pt+v(s) — max rals) + (Pals), v| < € for all state-action pairs s,a. 
(38.26) 


(a) Show that p > p* —e€. 

(b) Let ñ be the greedy policy with respect to v. Assume that 7 is ¢’- 
greedy with respect to v in the sense that rzs)(s) + (Pis)(s),v) = 
Maxac4 Tals) + (P.(s),v) — €’ holds for all s € S. Show that 7 is 2e + e 
optimal: př > p* — (2e + €’). 

(c) Suppose that p* in Eq. (38.7) is replaced with p € [p*, p* + 6]. Show that the 
linear program remains feasible and the solution (p, v) satisfies Eq. (38.26) 
with £ < |S|?6. 


38.17 (AVERAGE-OPTIMAL IS NEARLY FINITE-TIME OPTIMAL) Let M be a 
strongly connected MDP with rewards in [0,1], diameter D < co and optimal 
gain p*. Let vž (s) be the maximum total expected reward in n steps when the 
process starts in state s. Prove that ux(s) < np* + D. 


38.18 (HIGH PROBABILITY = EXPECTED REGRET) Prove that (38.12) follows 
from Theorem 38.6. 


38.19 (NECESSITY OF PHASES) The purpose of this exercise is to show that 
without phases, UCRL2 may suffer linear regret. For convenience, we consider the 
modified version of UCRL2 in Exercise 38.23 that does not know the reward. Now 
suppose we further modify this algorithm to re-solve the optimistic MDP in every 
round (Tk = k for all k). We make use of a two state deterministic MDP with 
two actions A = {sTay, Go}, depicted in Fig. 38.4. The transitions underlying 
the two actions are represented by dashed and solid arrows, respectively. 


1/2 1/2 


Figure 38.4 Transitions and rewards are deterministic. Numbers indicate the rewards. 


(a) Find all memoryless optimal policies for the MDP in Fig. 38.4. 
(b) Prove that the version of UCRL2 given in Exercise 38.23 modified to re-solve 
the optimistic MDP in every round suffers linear regret on this MDP. 


E 
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HıNT Since UCRL2 and the environment are both deterministic you can 
examine the behaviour of the algorithm on the MDP. You should aim to prove 
that eventually the algorithm will alternate between actions STAY and GO. 


[Long-term plans should have phases] The reason UCRL2, or more generally, 
optimistic algorithms without an explicit introduction of phases fail is 
because UCRL2 for creating a plan solves the infinite horizon problem where 
a reward in a state other than the current one and that is larger by the 
tiniest amount than the reward in the current state makes it worth to switch 
to the other state. If we considered a finite horizon version of the problem 
where experience is collected in episodes with some fixed start state or start 
state distribution, an optimistic algorithm would eventually stop considering 
switches because on a finite horizon, eventually the loss from using the 
actions that switch would be assessed to be higher than the potential gain 
from switching. 


38.20 (EXTENDED MDP Is STRONGLY CONNECTED) Let My be the extended 
MDP defined in Section 38.5.1 and C,, be the confidence set defined in Eq. (38.13). 
Prove that P € C,, implies that Mp is strongly connected. 


38.21 (CONFIDENCE SETS) Prove Lemma 38.8. 


HINT Use the result of Exercise 5.17 and apply a union bound over all state- 
action pairs and the number of samples. Use the Markov property to argue that 
the independence assumption in Exercise 5.17 is not problematic. 


38.22 Let (ap) and (Ap) be non-negative numbers so that for any k > 0, 
ak+1 < Ák = 1 V (a, +--+ ap). Prove that for any m > 1, 


m 


ye < (v2+1) VAn. 


zi V Ak-1 


HINT The statement is trivial if ye ay < 1. If this does not hold, use 

induction based on m = n,n + 1,..., where n is the first integer such that 
—1 

y ak > 1. 

38.23 (UNKNOWN REWARDS) In this exercise, you will modify the algorithm 

to handle the situation where r is unknown and rewards are stochastic. More 

precisely, assume there exists a function ra(s) € [0,1] for alla E€ A and s E€ S. 


Then, in each round, the learner observes S}, chooses an action A; and receives a 
reward X, € [0,1] with 


[Xa | At, St] = ra, (St) - 


In order to accommodate the unknown reward function, we modify UCRL2 in 
the following way. First, define the empirical reward at the start of the kth phase 
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by 


T ISu = s, Ay =a} Xu 
1V T,,-1(s, a) ` 


Pra(s) = 
u=1 


Then, let ft a(s) be an upper confidence bound given by 


i . L 
Fa(8) = firals) + y A VT, —1(8,4))’ 


where L is as in the proof of Theorem 38.6. The modified algorithm operates 
exactly like Algorithm 28, but replaces the unknown rq(s) with 7% 4(s) when 
solving the extended MDP. Prove that with probability at least 1 — 36/2, the 
modified policy in the modified setting has regret at most 


Ên < CD(M)S,|nA log =) , 


where C > 0 is a universal constant. 


38.24 (LOWER BOUND) In this exercise, you will prove the claims to complete 
the proof of the lower bound. 


(a) Prove Claim 38.9. 
(b) Prove Claim 38.10. 
(c) Prove Claim 38.11. 


38.25 (CONTEXTUAL BANDITS AS MDPs) Consider the MDP M = (S, A, P,r), 
where P,(s) = p for some fixed categorical distribution p for any (s,a) E€ S x A, 
where minses p(s) > 0. Assume that the rewards for action a in state s are 
sampled from a distribution supported on [0,1] (see Note 3). An MDP like this 
defines nothing but a contextual bandit. 


(a) Derive the optimal policy and the average optimal reward. 

(b) Show an optimal value function that solves the Bellman optimality equation. 

(c) Prove that the diameter of this MDP is D = max, 1/p(s). 

(d) Consider the algorithm that puts one instance of an appropriate version 
of UCB into every state (the same idea was explored in the context of 
adversarial bandits in Section 18.1). Prove that the expected regret of your 
algorithm will be at most O(VSAn). 

(e) Does the scaling behaviour of the upper bound in Theorem 38.6 match the 
actual scaling behaviour of the expected regret of UCRL2 in this example? 
Why or why not? 

(£) Design and run an experiment to confirm your claim. 


38.26 (IMPLEMENTATION) This is a thinking and coding exercise to illustrate 
the difficulty of learning in MDPs. The RiverSwim environment is originally due to 
Strehl and Littman [2008]. The environment has two actions A = {LEFT, RIGHT} 
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and S = [S] with S > 2. In all states s > 1, action LEFT deterministically leads 
to state s — 1 and provides no reward. In state 1, action LEFT leaves the state 
unchanged and yields a reward of 0.05. The action RIGHT tends to make the agent 
move right but not deterministically (the learner is swimming against a current). 
With probability 0.3, the state is incremented, with a probability 0.6, the state 
is left unchanged, while with probability of 0.1 the state is decremented. This 
action incurs a reward of zero in all states except in state S, where it receives a 
reward of 1. The situation when S = 5 is illustrated in Fig. 38.5. 


current 


Figure 38.5 The RiverSwim MDP when S = 5. Solid arrows correspond to action LEFT 
and dashed ones to action RIGHT. The right-hand bank is slippery, so the learner 
sometimes falls back into the river. 


(a) Show that the optimal policy always takes action RIGHT and calculate the 
optimal average reward p* as a function of S. 

(b) Implement the MDP and test the optimal policy when started from state 1. 
Plot the total reward as a function of time and compare it with the plot of 
tH tp*. Run multiple simulations to produce error bars. How fast do you 
think the total reward concentrates around tp*? Experiment with different 
values of S. 

(c) The e-greedy strategy can also be implemented in MDPs as follows: based 
on the data previously collected, estimate the transition probabilities and 
rewards using empirical means. Find the optimal policy 2* of the resulting 
MDP, and if the current state is s, use the action 7*(s) with probability 1— € 
and choose one of the two actions uniformly at random with the remaining 
probability. To ensure the empirical MDP has a well-defined optimal policy, 
mix the empirical estimate of the next state distributions P,(s) with the 
uniform distribution with a small mixture coefficient. Implement this strategy 
and plot the trajectories it exhibits for various MDP sizes. Explain what you 
see. 

(d) Implement UCRL2 and produce the same plots. Can you explain what you 
see? 

(e) Run simulations in RiverSwim instances of various sizes to compare the 
regret of UCRL2 and e-greedy. What do you conclude? 
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38.27 (UCRL2 AND UNREACHABLE STATES) Show that UCRL2 suffers linear 
regret if there is a single unreachable state with reward larger than the optimal 
gain. 


Hint Think about the optimistic MDP and the optimistic transitions to the 
unreachable state. The article by Fruit et al. [2018] provides a policy that mitigates 
the problem. 


38.28 (MDPS WITH TRAPS (1)) Fix state space S, action space A and reward 
function r. Let m be a policy with sublinear regret in all strongly connected 
MDPs (S, A,r, P). Now suppose that (S, A,r, P) is an MDP that is not strongly 
connected such that for all s € S, there exists a state s’ that is reachable from s 
under some policy and where p%, < max,, pj. Finally, assume that pg, = max, pù 
almost surely. Prove that 7 has linear regret on this MDP. 


38.29 (MDPS WITH TRAPS (11)) This exercise develops the ideas mentioned in 
Note 14. First, we need some definitions: fix S and A and define Io as the set 
of policies (learner strategies) for MDPs with state space S and action space A 
that achieve sublinear regret in any strongly connected MDP with state space S 
and action space A. Now consider an arbitrary finite MDP M = (S, A, P,r). A 
state s € S is reachable from state s’ € S if there is a policy that when started 
in s’ reaches state s with positive probability after one or more steps. A set of 
states C C S is a strongly connected component (SCC) if every state s € U 
is reachable from every other state s’ € C, including s = s’. A set C C S is 
maximal if we cannot add more states to C and still maintain the SCC property. 
A SCC C is called a maximal end component if there does not exist another 
SCC C’ with C Cc C’. Show the following: 


(a) There exists at least one MEC and two MECs C and C», are either equal 
or disjoint. 

(b) Let C1,...,Cp be all the distinct MECs of an MDP. The MDP structure 
defines a connectivity over C1, ..., Ck as follows: for i Æ j, we say that C; is 
connected to C} if from some state in C4, it is possible to reach some state of 
C; with positive probability under some policy. Show that this connectivity 
structure defines a directed graph, which must be acyclic. 

(c) Let Ci,...,Cm with m < k be the sinks (the nodes with no out edges) of 
this graph. Show that if M is strongly connected, then m = 1 and C4 = S. 

(d) Show that for any i € [m] and for any policy m € Ilo, it holds that m will 
reach C; in finite time with positive probability if the initial state distribution 
assigns positive mass to the non-trap states S \ UjeimjCi. 

(e) Show that for i < m, for any s € C; and any action a € A, P,(s,s’) =0 for 
any s’ € S \ Cj, i.e., C; is closed. 

(£) Show that the restriction of M to C; defined as 


M; = (Ci, A, (Pa(S))s€Ci a64, (rals))seCi aca) 
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is an MDP. 
(g) Show that M; is strongly connected. 
(h) Let 7 be the time when the learner enters one of C),...,Cm and let I € [m] 


be the index of the class that is entered at time 7. That is, S+ € Cr. Show 


that if M is strongly connected, then T = 1 with probability one. 
(i) We redefine the regret as follows: 


T+n-1 


Ra =E 5 TA, (S+) = np” (Mr) 


t=T 


Show that if M is strongly connected, then Rn = R4. 


(j) Can you design a policy with R} = O(E[D(Mr)|Cr|]| v Anlog(n))? Will 


UCRL2 already satisfy this? 


The logic of the regret definition in part (i) is that by part (d), reasonable 
policies cannot control which trap they fall into in an MDP that has more 
than one traps. As such, policies should not be penalised for what trap they 
fall into. However, once a policy falls into some trap, we expect it to start 
to behave near optimally. What this definition is still lacking is that it is 
insensitive to how fast a policy gets trapped. The last part is quite subtle 
[Fruit et al., 2018]. 


38.30 (CHAIN RULE FOR RELATIVE ENTROPY) Prove the claim in Eq. (38.22). 


HINT Make use of the result in Exercise 14.13. 
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Borel space, 46 
boundary, 314 
Bregman divergence, 308—310, 328, 
437 
Bretagnolle-Huber inequality, 190, 201, 
209, 217, 289, 407 
Bretagnolle-Huber-Carol inequality, 88 
Brownian motion, 130, 386 


canonical bandit model, 198 
k-armed stochastic, 63 


Bayesian, 438 
contextual, 70 
infinite-armed stochastic, 66 
Carathéodory’s extension theorem, 25 
cardinal optimisation, 417 
cascade model, 390 
categorical distribution, 88, 488, 527 
Catoni’s estimator, 116 
cell decomposition, 483 
Cesaro sum, 517, 534 
chain rule of probability measures, 53 
change of measure, 39 
Chernoff bound, 134, 399 
chi-squared distance, 193, 204 
chi-squared distribution, 81, 242 
click model, 389 
closed function, 376 
closed set, 314 
complement, 21 
concave, 308 
conditional expectation, 33 
conditional independence, 37 
conditional probability, 28 
conjugate pair, 427 
conjugate prior, 426, 427 
consistent policy, 207, 296 
contextual bandit, 70, 224, 383 
adversarial linear, 354 
stochastic, 232, 235, 238—240 
stochastic linear, 240—245, 354 
contextual partial monitoring, 508 
controlled Markov environment, 538 
convex hull, 306 
convex optimisation, 327 
core set, 268 
counting measure, 40 
covering, 255, 263 
Cramér transform, 83, 84 
Cramér—Chernoff method, 77—79, 168, 
257, 258 
cumulant generating function, 77, 142 
cumulative distribution function, 33 


D-optimal design, 268 
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data processing inequality, 196, 490 
degenerate action, 483 
density, 39 
derivative-free stochastic optimisation, 
416 

descriptive theory, 67 
deviation matrix, 540 
diameter 

of convex set, 334 

of MDP, 515 
differential value function, 517 
discount factor, 384 
discounting, 384, 448, 455 
disintegration theorem, 53, 435, 436 
distribution, 21 
Dobrushin’s theorem, 192, 196 
domain, 306 
dominated action, 483 
dominating measure, 40 
doubling trick, 95, 97, 159, 287, 341 
dual norm, 335 
dynamic programming, 455 


easy partial monitoring game, 483 

effective dimension, 251 

elimination algorithm, 95, 96, 98, 123, 
211, 273, 278 

elliptical potential lemma, 243 

empirical risk minimisation, 233 

entropy, 186, 187, 194 

events, 19 

exchangeable, 237 

Exp3, 152, 225, 319, 321, 327, 345, 364, 
378, 380, 482 

Exp3-IX, 166, 215, 364, 386 

Exp3.P, 173, 234, 386 

Exp3.S, 386 

Exp4, 229, 249, 379 

expectation, 30 

explore-then-commit, 91, 109, 232, 248 

exponential family, 121, 141, 211, 212, 
408, 428, 428, 471 

exponential weighting, 152, 367 

algorithm, 158 


continuous, 321-323 
extended real line, 306 


feasible, 519 
feature map, 239 
feature space, 240 
feature vector, 240 
feedback matrix, 480 
Fenchel dual, 84, 257, 307 
filtered probability space, 28 
filtration, 27 
finite additivity, 21 
first-order bound, 171, 341, 347 
first-order optimality condition, 312 
Fisher information, 202 
fixed design, 254 
fixed share, 385 
follow the leader, 328, 342 
follow-the-perturbed-leader, 163, 232, 
367, 463 
follow-the-regularised-leader, 328, 355, 
356 
changing potentials, 346 
Frank—Wolfe algorithm, 270, 272 
Fubini’s theorem, 40 
full information, 158, 385 
fundamental matrix, 540 
G-optimal design, 268 
gain, 516 
game theory, 183 
Gaussian distribution, 39 
Gaussian tail 
lower bound, 475 
upper and lower bounds, 184 
upper bound, 76 
generalised linear bandit, 250 
generalised linear model, 247 
Gittins index, 385, 447 
globally observable, 486 
gradient descent, 329 
graph Laplacian, 252 


Hahn decomposition, 31 
hard partial monitoring game, 483 
Hardy-—Littlewood, 451, 458, 459 
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heavy tailed, 77 

Hedge, 157 

Hellinger distance, 193 

Hoeffding’s inequality, 80, 136, 399 

Hoeffding’s lemma, 80, 84, 135, 158, 
264 

Hoeffding—Azuma, 264, 529 

hopeless partial monitoring game, 483 

Huffman coding, 187 

hypercube, 278, 289, 323, 340 

hypothesis space, 424 

hypothesis testing, 196 

image, 503 

implicitly normalised forecaster, 348 

importance-weighted estimator, 150, 
151, 171, 232, 338 

independent events, 29 

index, 447 

index policy, 447 

indicator function, 23 

information-directed sampling, 
473, 476, 508 

instance-dependent bound, 233 

integrable, 31 

interior, 314 

Tonescu—Tulcea theorem, 48, 66, 431, 
514 

isomorphic measurable spaces, 46 


247, 


Jensen’s inequality, 308 
John’s ellipsoid, 324 


Kearns-Saul inequality, 85 

kernel, 503 

kernel trick, 245 

Kiefer—Wolfowitz, 268, 275, 321, 324, 
337, 355 

Kraft’s inequality, 191 

Kullback—Leibler divergence, 186 


Laplace’s method, 258 

large deviation theory, 83 

law, 22 

law of the iterated logarithm, 112, 124, 
265 

law of total expectations, 36 


Le Cam’s inequality, 191 
Le Cam’s method, 202, 203 
learning rate, 152 
adaptive, 341, 347 
time-varying, 231, 329, 346, 347 
least-squares, 254 
Lebesgue integral, 30 
Lebesgue measure, 32 
Legendre function, 310—312, 328, 367 
light tailed, 77 
likelihood ratio, 266 
linear subspace, 503 
link function, 247 
Lipschitz bandit, 249 
locally observable, 486 
log partition function, 428 
log-concave, 322 
logistic function, 428 
loss matrix, 480 
LOTUS, 33 
margin, 250 
Markov chain, 48—49, 514 
Markov kernel, 48 
Markov policy, 514 
Markov process, 52 
Markov property, 533 
Markov reward process, 447 
martingale, 49 
maximal end component, 547 
maximal inequality, 51, 124, 260 
measurable map, 22 
measurable set, 21 
measurable space, 21 
measure, 21 
median-of-means, 115 
memoryless deterministic policy, 514 
memoryless policy, 514 
metric entropy, 263 
minimax, 123 
minimax optimal, 180 
mirror descent, 158, 328, 367, 380 
misspecified linear bandit, 275, 293, 
355, 356 
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model, 424 

MOSS, 121, 123 

multi-class classification with bandit 
feedback, 234 

multi-task bandit, 292, 364, 376 

multinomial distribution, 88 


nats, 188 

negative correlation, 169 
neighbouring actions, 484 
Neyman-Pearson lemma, 192, 196 
non-anticipating sequence, 346 
non-oblivious, 158, 341 
non-parametric, 58 

non-singular exponential family, 428 
non-stationary bandit, 68 
nonstationary, 159 

null set, 35 


oblivious, 335, 341 

oblivious adversary, 158 

one-armed bandit, 10, 71, 121 
Bayesian, 443—447 

online gradient descent, 329 

online learning, 17, 283, 327 

online linear optimisation, 327 

online-to-confidence set conversion, 282 

open set, 314 

operator, 41 

optimal experimental design, 268, 414 

optimal value function, 517 

optimisation oracle, 232, 372 

optimism bias, 110 

optional stopping theorem, 50 

ordinal optimisation, 417 

orthogonal complement, 503 

outcome space, 19 


packing, 263 

parameter noise, 352 

parametric, 58 

Pareto optimal, 184, 421 

Pareto optimal action, 483 

partial monitoring, 13 

partially observable Markov decision 
process, 533 


peeling device, 124 

permutation, 388 

Pinsker’s inequality, 134, 141, 193, 194, 
334 

point-locally observable, 505 

Poisson distribution, 82 

policy, 64 

policy iteration, 535 

policy schema, 434 

position-based model, 390 

posterior, 423 

potential function, 328 

predictable, 27 

predictable variation, 86, 87 

prediction with expert advice, 158 

preimage, 20 

prescriptive theory, 67 

prior, 421, 424 

prior variance, 427 

probability distribution, 21 

probability kernel, 48, 424 

probability measure, 21 

probability space, 21 

product o-algebra, 25 

product kernel, 48 

product measure, 40, 65 

projective, 47 

push-forward, 22 


quadratic variation, 171 


Rademacher variable, 83 
Radon-Nikodym derivative, 39 
random 

element, 22 

variable, 22 

vector, 22 
random table model, 65, 69, 150 
random variable, 19 
ranked bandit model, 399 
ranking and selection, 416 
rate function, 84 
reactive adversary, 158 
reduction, 351, 404 
regret, 10 
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adversarial, 148 

non-stationary, 379 

policy, 158 

pseudo, 68 

pseudo, random, 216 

random, 68 

stochastic, 60 

tracking, 379 
regret decomposition lemma, 62 
regular exponential family, 428 
regular version, 52, 196 
regularised risk minimisation, 344 
regulariser, 328 
reinforcement learning, 13, 96, 455, 523 
relative entropy, 188—191, 310, 437 
restless bandit, 385, 457 
retirement policy, 71 
reward-stack model, 65, 69, 451 
ridge regression, 254 
right stochastic matrix, 512 


semi-bandit, 362, 365—372, 400, 402 

semibandit, 474 

separation oracle, 322, 521 

sequential halving, 413 

Sherman-Morrison formula, 253 

signal variance, 427 

signal-to-noise ratio, 353 

signed measure, 21 

similarity function, 227 

simple function, 31 

Sion’s minimax theorem, 343, 349, 434, 
473 

sliding window, 384 

smoothness, 240 

source coding theorem, 187 

span, 517 

spectral bandit, 251 

state space, 512 

static experts, 232 

stationary transition matrix, 517, 540 

stochastic optimisation, 413 

stochastic process, 47 

stopping rule, 50 
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stopping time, 50 zeroth-order stochastic optimisation, 


strictly convex, 307 416 

strongly connected component, 547 

sub-o-algebra, 21 

submartingale, 49, 124 

suboptimality gap, 62 

sufficient statistic, 435, 444, 450, 533 
of exponential family, 428 

supermartingale, 49, 262 

supervised learning, 227 

support, 41 

support function, 368, 375 

supporting hyperplane, 312 
theorem, 315, 438 

tail probability, 33 

Thompson sampling, 68, 121, 460 
for reinforcement learning, 537 

total variation distance, 193, 193 

tower rule, 36 

track-and-stop algorithm, 410 

transductive learning, 234 

transition matrix, 516 

trivial event, 44 

trivial partial monitoring game, 483 


UCB-V, 115 

uniform exploration algorithm, 403 

union bound, 78 

unit ball, 4, 248, 263, 290, 334, 337, 342, 
345, 350, 353, 355 

unit sphere, 287, 291 

universal constant, 4 

unnormalised negentropy, 310, 311, 
316, 330, 345, 365, 380, 470 

unstructured bandits, 207 


value function, 442 

vanishing discount approach, 534 
Varaiya’s algorithm, 456 

VC dimension, 236 

von Neumann-Morgenstern theorem, 67 
Wald—Bellman equation, 442 

weak neighbour, 505 

weak* topology, 41 

worst-case regret, 180 


